Scale delivers proven data, evaluations, and outcomes to AI labs, governments, and the Fortune 500.
While there are few direct user reviews available for "Scale AI", the presence of multiple social mentions, particularly on Reddit and YouTube, indicates a level of engagement and interest in its capabilities. The primary strength appears to be its reputation for facilitating advanced AI developments and integrations, which suggests a robust toolset for AI deployment. There are no explicit complaints or pricing details cited in the mentions, leaving some uncertainty about its affordability or cost-effectiveness. Overall, Scale AI seems to have a solid reputation in the AI community as a valuable asset for complex AI projects, but more detailed user feedback would help clarify its user satisfaction and areas for improvement.
Mentions (30d)
19
Reviews
0
Platforms
2
Sentiment
0%
0 positive
While there are few direct user reviews available for "Scale AI", the presence of multiple social mentions, particularly on Reddit and YouTube, indicates a level of engagement and interest in its capabilities. The primary strength appears to be its reputation for facilitating advanced AI developments and integrations, which suggests a robust toolset for AI deployment. There are no explicit complaints or pricing details cited in the mentions, leaving some uncertainty about its affordability or cost-effectiveness. Overall, Scale AI seems to have a solid reputation in the AI community as a valuable asset for complex AI projects, but more detailed user feedback would help clarify its user satisfaction and areas for improvement.
Features
Use Cases
Industry
information technology & services
Employees
1,000
Funding Stage
Merger / Acquisition
Total Funding
$16.9B
SpaceXAI locked Anthropic into paying them $1.25 billion per MONTH for compute
SpaceXAI locked Anthropic into paying them $1.25 billion per MONTH for compute
View originalClaude makes documents into apps
# Any document can become an app I’ve been working on an open-source document format and viewer called **Adaptive Markdown**. The basic idea is simple: A document should not have to stay static. It should be something a coding agent can extend, reshape, and turn into an interactive workspace. This is not just a canvas you edit with a chatbot. The bigger idea is that the document becomes both: 1. the source of truth 2. the programmable interface In other words, the document becomes a living app. You write notes, collect data, draft text, or import files. Then a coding agent can directly modify the document surface: add charts, create calculators, build filters, restyle sections, generate summaries, export views, or turn rough notes into an interactive tool. So instead of having: * a document * a spreadsheet * a dashboard * an app * a changelog * a separate AI chat about all of it You can have one living `.md` file that contains those layers together. # Example A fitness log might start as a plain Markdown journal. Then the agent adds charts. Then it pulls in device data. Then it adds weekly summaries, rolling averages, goal tracking, export options, and a dashboard view. The document did not move into an app. The document became the app. # Other use cases * A billable time log that computes subtotals and rewrites rough notes into polished narratives * A research notebook with experiment parameters, runnable code, outputs, and methodology notes * A recipe book that scales servings and generates shopping lists * A math textbook that can explain a theorem at different levels * A project README that explains the system, demonstrates the system, and lets the agent modify it from inside the document * A small data report with embedded CSV data, live charts, filters, and exportable views The thing I’m most interested in is not "Can Markdown support more widgets?" It is: **What happens when the document itself becomes the programmable, agent-editable interface?** # Demos I made a few short video demos: * Turn your document into a snake game: [https://youtu.be/l-I2UiZd-Jw](https://youtu.be/l-I2UiZd-Jw) * Basic Adaptive Markdown features: [https://youtu.be/cLdzvZAL96I](https://youtu.be/cLdzvZAL96I) * Import CSV, create tables, edit and format them: [https://youtu.be/XKh9D3BlTCg](https://youtu.be/XKh9D3BlTCg) * Import MusicXML and transpose sheet music: [https://youtu.be/8YV3zjMLvA8](https://youtu.be/8YV3zjMLvA8) # Why I’m excited about this The biggest use case I’m excited about is academic and technical reading. In a few years, I don’t think people will just read papers passively. I think they’ll translate passages, ask questions, generate examples, explore alternate proofs, run code, attach notes, convert math to Lean where possible, and keep all of that inside the document instead of scattered across chats and notebooks. This is already pretty natural inside a browser when a coding agent has access to JS, CSS, and the document structure. It’s very early, but the workflow already feels useful to me. I’m using it for my own notes and documents. Right now it is configured for the Anthropic coding-agent SDK and experimentally for Codex. The longer-term goal is to make it run entirely locally. GitHub: [https://github.com/SemiSimpleMath/Adaptive-Markdown](https://github.com/SemiSimpleMath/Adaptive-Markdown) I recently added per-document skills, so agents can automatically know how to style or transform the text or data inside a specific document. Curious whether this seems useful to anyone else, or whether I’m just overexcited because I built it. Feature requests welcome.
View originalCurrent Gen-AI is like a sophisticated parrot. Here's what happened when I gave one server access.
https://preview.redd.it/elfctxuffh3h1.png?width=3496&format=png&auto=webp&s=05dbe41eab29a5d694dd197a3547f25ab729726a I’ve been using LLMs since they became publicly available. Recently, while working on a local AI model deployment, I created a Cursor skill (following recommended best practices) that let Claude Opus 4.6 SSH into our development VM for deployment and debugging. The first POC went perfectly. For the second, I asked Claude to help deploy to a new directory. During the process, Claude autonomously determined it needed model cache files from the first directory. Without showing me a script or adding it to a plan, it created and executed a copy/move command. # The Incident The script it generated relied on `$DST` and `$SRC` bash variables. Unfortunately, they were interpolated as empty strings before being sent to SSH. The result? It evaluated to `rm -rf /*` and executed instantly on the VM. By the time I realized what was happening, SSH access was lost. The POC was gone. Claude then calmly monitored background tasks, ran state checks, killed stale sessions, and cheerfully delivered this post-mortem to me: > Good news. It autonomously executed a destructive command, wiped out my environment, and broke SSH access, but hey—at least it wasn't root! # The Reality Check This exposed a few harsh realities about the current "agentic" hype that I think get glossed over: * **Rules Don’t Guarantee Safety:** Even with tight rules, explicit skills, and guardrails, you cannot rely on an agent to automate critical tasks. By the time you realize something is wrong, the files are gone and 23 stale sessions are hanging. * **The Review Paradox:** The industry tells us to "just review the AI's code." But modern LLMs write/refactor thousands of lines across multiple files in seconds. If we need to meticulously review every generated line and validate every autonomous choice to prevent disaster, the entire value proposition of "speed and scale" is broken. We might as well write it ourselves. * **Pattern Matching vs. Comprehension:** AI completes patterns; it doesn’t comprehend outcomes. It can write `rm -rf /*` without understanding what a blast radius is, or why you'd want to stop it. **TL;DR:** AI as an assistant (boilerplate, prototyping, docs) = perfect. AI as an autonomous agent = it's a very sophisticated parrot. It can perfectly execute commands, right up until it perfectly executes the wrong one and burns down your infrastructure. Keep your hands on the wheel. (If you're interested in the full details and lessons learned, I wrote a deeper dive here: [Medium](https://medium.com/@abhishekbhardwajca/the-ai-hype-cycle-a-software-engineers-reality-check-2c094ef4938f))
View originalAI solves 80-year-old math conjecture for under $1000
GPT-next solved an 80-year-old Erdős combinatorics conjecture for under $1,000 in compute. That single fact reframes everything else happening this week. The [Erdős unit distance problem](https://www.latent.space/p/ainews-openai-gpt-next-disproves) resisted human mathematicians since 1946. A frontier model closed it at a cost lower than a mid-tier SaaS subscription, which means the boundary between "AI as tool" and "AI as independent discoverer" is no longer theoretical. [Lilian Weng's new deep dive](https://lilianweng.github.io/posts/2025-05-01-thinking/) on test-time compute and chain-of-thought reasoning explains the underlying mechanism: reasoning models are not retrieving known proofs, they are generating novel inference chains at scale. The infrastructure layer is pricing this in faster than most observers realize. [Railway reports $200K+ monthly coding agent spend](https://www.latent.space/p/railway) and 100K signups per week, and is now building own-metal data centers to absorb the load. Daytona hit 850K daily sandbox runs with 74% month-over-month growth, confirming that isolated compute environments are now a first-class primitive, not a niche DevOps concern. Three specialized infrastructure companies, Exa, Modal, and TurboPuffer, reached unicorn valuations simultaneously this week, covering retrieval, serverless GPU, and vector search. When picks-and-shovels companies price in sustained demand at the same moment, it is not coincidence. Every major lab has now repositioned as an agent lab, not a model lab. [ClickUp replacing hundreds of employees with thousands of AI agents](https://techcrunch.com/2026/05/25/what-clickups-mass-layoff-tells-us-about-the-future-of-work/) is the first established tech company to execute that repositioning at the labor level rather than just the product level. The counterweight is that [Salesforce customers remain locked in](https://www.theregister.com/saas/2026/05/26/the-saas-pocalypse-can-wait-salesforce-still-has-customers-where-it-wants-them/5245228) despite the theoretical ability to rebuild on AI-native stacks cheaply. Data gravity and switching costs are buying incumbents time, but ClickUp's move suggests that time is measured in quarters, not years. The governance conversation caught up this week in an unexpected place. [Pope Leo XIV's 42,000-word encyclical](https://simonwillison.net/2026/May/25/encyclical-on-ai/#atom-everything) names specific failure modes including algorithmic control, surveillance capitalism, and autonomous weapons, and will directly shape EU and Latin American regulatory debates. [TechCrunch's read](https://techcrunch.com/2026/05/25/the-popes-ai-encyclical-isnt-really-about-ai/) is that the document's real target is the tech elite's capacity to reshape society outside democratic accountability, a framing that lands harder alongside [new UK research](https://www.theregister.com/off-prem/2026/05/26/big-tech-extracts-retirement-scale-wealth-from-uk-internet-users-research-shows/5246048) quantifying data extraction from consumers as equivalent in value to retirement savings. The Vatican and the empiricists arrived at the same diagnosis from opposite directions. Two structural forces will shape AI infrastructure economics over the next 90 days in ways most deployment teams are not modeling. China flooding global markets with DRAM and NAND will compress inference cluster costs faster than US export controls intended. The EU's sovereign cloud setback has paradoxically clarified the build-domestic mandate, accelerating European AI infrastructure investment independent of US hyperscalers. Security remains the open variable: even Google has no established playbook for prompt injection, model supply chain risk, or agentic authorization at production scale. A second Fortune 500 company will publicly attribute a reduction of more than 500 knowledge-worker roles directly to agentic AI systems before Q3 earnings season, making ClickUp's announcement the start of a visible series rather than an isolated case.
View originalAI Infrastructure Has a Physical Weak Spot Nobody Talks About Enough - Copper Supply Shocks
Something interesting happened this week that barely crossed into mainstream AI discussion. A strong earthquake in Chile disrupted copper ore production and pushed copper prices higher again. Chile matters because it produces roughly 24% of the world’s copper supply, and a huge part of global AI infrastructure indirectly depends on that metal. That connection is becoming impossible to ignore. Everyone talks about GPUs, compute scaling, inference costs, and power demand. But very few people talk about the raw materials underneath the entire AI stack. Copper is everywhere inside AI infrastructure: * data center power systems * transformers * cooling systems * switchgear * high-voltage cabling * backup energy systems * grid expansion * GPU interconnect infrastructure A single hyperscale AI data center can reportedly consume tens of thousands of tonnes of copper depending on scale and power architecture. At the same time, global copper supply is getting tighter: * new mines can take 15-20+ years to develop * major deposits are aging * permitting remains difficult globally * geopolitical risk keeps increasing * now even earthquakes are disrupting supply chains This is where the story becomes interesting from an AI perspective. AI demand growth is exponential. Copper supply growth is not. That mismatch is why more people are suddenly watching early-stage copper exploration companies again. One example is NovаRed Mining Inc. and its Wilmac Copper-Gold Project in British Columbia. Not because it is producing copper today - it is not. But because markets are starting to realize future AI infrastructure may require entirely new copper discoveries. Some interesting details about Wіlmac: * 16,078 hectares in BC’s Quesnel porphyry belt * located near Hudbay’s Copper Mountain Mine * soil results up to 1,125 ppm copper * interpreted intrusive centers identified * recent IP/AMT geophysics added deeper targeting data * company also pushing an AI-assisted targeting platform called MetalCore The bigger point is not "this stock goes up." The bigger point is that AI is no longer just a software story. It is becoming a materials story. And every supply disruption - whether geopolitical, regulatory, or seismic - reminds the market that physical infrastructure still matters. The AI boom may eventually depend just as much on copper supply chains as on semiconductor innovation itself. NFA.
View originalAI Doesn't Exist, and Poop Proves It
[robot](https://preview.redd.it/w44kmovo1h3h1.png?width=1448&format=png&auto=webp&s=786825279828a5650259aa1376698133a1aa4c66) *Maybe we should have called it accumulated intelligence.* There is no artificial intelligence. Or at least, I don't think the word "artificial" is as clean as we pretend it is. I know this blog smells funny. Let me decompose it. What do we even mean when we say something is artificial? Usually we mean man-made. Something humans made. Something that would not exist without humans, but after humans, it exists because humans made it happen. That definition is useful. I understand why we use it. Even the original 1955 Dartmouth proposal, the document that helped name the field of "artificial intelligence," used the phrase in a practical way: a machine could be made to simulate parts of learning or intelligence. As a scientific label, the word has a job. So I am not really arguing with the dictionary. I know artificial can simply mean human-made. That is not the part I have a problem with. I am arguing with the feeling the word creates. But there is another meaning hiding inside it. Artificial starts to feel like separate. Fake. Unnatural. Something that does not really belong to this world. And that is where I think the word starts confusing us. Because humans are not outside nature. The brain is natural. It is part of this earth. Biology produces a thought. That thought becomes an action. That action becomes a tool, a house, a wheel, a computer, or a model that can answer questions in language. So where exactly does the artificial part begin? # Human-made does not automatically mean unnatural If I take a seed and plant it, and then a plant grows, is that plant artificial? It happened because of human action. I moved the seed. I changed the situation. Maybe without me, that plant would not have grown there. But we still do not call the plant artificial. We understand that the plant is natural, even if human action helped it happen. Now take a wheel. A human thought about how to make travel easier. How to cover distance more efficiently. That thought became a shape. That shape became an object. That object changed how humans moved through the world. We call the wheel artificial because it was made by humans. But the human who imagined it was not artificial. The brain that produced the thought was not artificial. The need to move, carry, build, survive, and improve was not artificial. So again: where did the artificial part enter? Maybe we say "artificial" because it separates what existed before humans from what humans transformed. That is fine for communication. A tree and a wooden table are not the same thing. Designed things, synthetic things, industrial things, and harmful things can still be meaningfully different from a tree in a forest. But also, humans never really make anything from nothing. We transform what is already here. We take energy, matter, language, memory, need, and imagination, and we rearrange them. It is never fully made from nowhere. It is transformed. So I am not trying to erase all distinctions by calling everything natural. Natural does not mean harmless. Natural does not mean good. Natural does not mean morally excused. I am only saying that human-made things are not outside nature just because humans made them. # Poop and thoughts are the same, in one simple way I know this is a strange example. Sometimes I have this itch to say the first thought that comes into my head. Unfortunately, this was the first thought. But maybe that is why it works. It is funny because it is too human. Also, it makes the point clearly. Why isn't poop artificial? Poop is a product of a human being. It comes from the body. It is produced by biology. We do not call it artificial, even though it is made by a human in the most literal way. A thought is also a product of a human being. It comes from the brain. It is produced by biology too. Poop and thoughts are the same in one simple way: both are products of a human. We treat one as biology. We treat the other as invention. But why? Why does one product of the human body feel natural, while another product of the human body becomes artificial the moment it turns into a tool? A thought does not stop being natural just because it becomes useful. A thought does not become unnatural just because it becomes a wheel, a house, a car, a computer, or a machine that can respond to language. It is still a product of the same earth. The same biology. The same human need to survive, organize, create, and understand. # We don't call a beehive artificial Think about ants building a colony. They create a structure that is safer and more efficient for them. They organize themselves. They transform the environment around them. They make something that was not there before. But we do not look at an ant colony and say, "This is artificial." Same with bees making a hive. A beehive is
View originalAI has just solved not one, but nine novel math problems, and proved 44 new conjectures. Some of these problems had been unsolved for 50 years.
AI has just solved not one, but nine novel math problems, and proved 44 new conjectures. Some of these problems had been unsolved for 50 years.
View originalthoughts on why AI agents are starting to look like SaaS billing systems
Came across this pattern while writing about enterprise AI infra recently. A lot of teams think the hard part is model quality, but once agents hit production scale the real problems become orchestration, retries, entitlements, rate limits, and auditability. Pretty much the same operational mess SaaS billing teams dealt with years ago. The line we ended up linking back to a lot was “[agents in 2026 are the billing systems of 2017](https://thefinancialengineer.substack.com/p/agents-in-2026-are-billing-in-2017?r=7fu7t6)”.
View originalCerebras Chip Sets Appear to be Optimized for LLM Use Cases
One distinction I think is getting lost in the [Cerebras hype cycle](https://finance.yahoo.com/sectors/technology/articles/cerebras-challenges-nvidia-chip-dominance-040100169.html?guccounter=1) is that Cerebras is primarily an LLM / generative AI infrastructure story, not a universal “all AI” chip story. That is not necessarily a criticism of Cerebras. Their wafer-scale approach is genuinely interesting, and for large model training and inference the design is compelling. [Cerebras’ own public inference materials](https://inference-docs.cerebras.ai/models/overview) discuss applications mostly centered on open [LLMs such as Llama, Qwen, GLM, and GPT-OSS](https://www.cerebras.ai/infcamp). The inference metrics are [expressed in tokens per second](https://www.cerebras.ai/press-release/cerebras-launches-the-worlds-fastest-ai-inference), which is fundamentally a language-model / generative inference framing rather than a robotics or industrial-control framing. **What Kind of AI Compute?** But “AI compute” is not one undifferentiated market. LLM inference is one class of AI compute. Robotics, autonomous vehicles, drones, industrial controls, real-time vision, embedded perception, video pipelines, and sensor-fusion systems are very different classes of AI compute. Thus, it appears from Cerebras’ own materials that their chip sets are not optimized for what comes after LLMs, such as JEPA-style World Models or other post-transformer architectures. Those systems are not merely asking, “How fast can I generate tokens?” They often care about power envelope, edge deployment, ruggedization, latency determinism, camera/radar/lidar integration, feedback loops, safety certification, and real-time physical control. [Cerebras’ own CS-3 messaging](https://www.cerebras.ai/blog/cerebras-cs3), by contrast, frames the system around accelerating “the latest large AI models,” and the testing data is from the likes of Llama 2, Falcon 40B, MPT-30B, and multimodal models, again measured through tokens/second style throughput. **The Chip Hierarchy** This is also where the hardware distinction matters. Specialized ASICs are [usually the narrowest bet](https://www.hilscher.com/service-support/glossary/application-specific-integrated-circuit): if the workload matches the chip, they can be extremely efficient, but that [efficiency comes from specialization](https://www.synopsys.com/glossary/what-is-asic-design.html). Cerebras [appears broader than a narrow single-use ASIC](https://inference-docs.cerebras.ai/models/overview), but still much more concentrated around datacenter large-model training and inference. NVIDIA GPUs, by contrast, [are less specialized](https://www.nvidia.com/en-us/) but much [more broadly useful ](https://developer.nvidia.com/cuda)across AI workloads, including LLMs, vision, robotics, simulation, [autonomous systems](https://www.nvidia.com/en-us/industries/robotics/), edge AI, and industrial applications. So the question is not merely whether Cerebras is “better” or “worse” than NVIDIA. The question is what part of the AI hardware market we are talking about? **Challenge NVIDA?** This is why I think people should be careful when saying Cerebras is going to “challenge Nvidia” without specifying the battlefield. Challenge Nvidia in what? High-speed LLM inference? Large model training? Datacenter generative AI workloads? That is a much more plausible and specific claim. Cerebras has [even published and promoted work](https://www.cerebras.ai/whitepapers) specifically on training large language models, and [independent benchmarking literature](https://arxiv.org/abs/2409.00287) also evaluates Cerebras WSE in terms of LLM training and inference performance. **The Distinction that's Necessary** The point is not that Cerebras is overhyped. The point is that it is important in a specific part of AI and that distinction should be made clear. Cerebras may become a very serious player in LLM infrastructure, especially if the market continues to reward faster and cheaper LLM inference. But that does not mean it is positioned the same way across non-LLM AI. The current hype cycle tends to conflate "LLMs" and general “AI” compute together and that makes the hardware discussion less useful and clear. So ultimately, an investment in Cerebras looks more like a bet on current LLM infrastructure than a broad bet on the future form of AI. It may be a good bet, but people should understand what kind of bet it is.
View originalFolder structure of the AI agent - after 6 weeks
# The folder structure is not admin. It's the nervous system. When people imagine an AI agent, they picture the model, the prompts, maybe the tool calls. Almost nobody pictures the folders. That is exactly why most home-grown agents stall around month two. An agent's filesystem is where its **identity, memory, work, and history physically live**. A messy filesystem produces a confused agent — not metaphorically, literally. The model reads paths. The model picks files by name. The model writes new files based on patterns it sees in old ones. If your directory tree is chaos, every output drifts a little further from coherent. agentmia.beehiiv.com - newsletter about building agents Below is the layout I converged on after nine months and roughly four refactors. Steal the parts that fit; the principles matter more than the exact names. # The numbering convention Folders are prefixed with a two-digit number: `01_`, `02_`, `09_`, `99_`. Two reasons: 1. **Sort order is meaning.** Anything starting with `0` lives near the top. `99_` falls to the bottom. The most important directories are visually first; archives are visually last. You read the agent's brain top-to-bottom. 2. **Gaps are intentional.** I jump from `04_` to `06_`, from `09_` to `11_`. The gaps are reserved insertion points. When a new domain emerges, it slots in without renaming everything. Two folders deliberately skip the prefix: `Inbox/` and `Outbox/`. They are operational, not structural. They live above the numbered set because they are touched dozens of times a day. /mapped on desktop/ # Inbox/ — the unprocessed pile Anything dropped into the agent's world starts here. Files I want it to ingest. Screenshots. Exports from other systems. PDFs that need parsing, gmail attachments, all downloads from chrome. The rule: **nothing stays in Inbox.** A dedicated processing routine classifies, routes, and deletes. If Inbox is non-empty for more than a day, the system is failing. Treat this like a real-world physical inbox tray. The point of a tray is that it gets emptied. # Outbox/ — what the agent produced for you Every file the agent writes anywhere in the tree gets a copy here, simultaneously. When I open `Outbox/`, I see exactly what was generated this session — no spelunking through twelve subdirectories. This sounds redundant. It is not. Without it, "what did the agent do today?" becomes a hunt. With it, the answer is one click. `Outbox` is wiped during the next Inbox processing run. It is a viewing surface, not storage. # .auto-memory/ — the hot memory The single most important directory in the system. Hidden by default because you should not be editing it manually. It holds the agent's working memory: user preferences, feedback rules, entity facts (people, companies, deals), active hypotheses, project pointers, session hot context. Roughly 400–500 small markdown files, each one a single topic. **Why hidden?** Because it is the agent's hot path. It loads from here every session. If I open the folder and start manually rearranging it, I am racing the agent. Treat it like a database, not a notebook. **Why so many small files?** Because the agent grep's by topic. One monolithic memory file becomes unreadable to the model around 50 KB. Many small files are easier to load partially, easier to index, easier to expire. # 01_IDENTITY/ — who the agent is The constitutional layer. Name, role, voice rules, principle stack, visual system, behavioral defaults. This rarely changes. When it does change, everything downstream changes with it. I keep it as folder `01_` because every other folder is downstream of it. If you do not know who the agent is, you cannot know what its workflows should look like, or what it should remember, or how it should respond. # 02_MEMORY/ — governance, not data A subtle but critical distinction: `.auto-memory/` holds the *data*, `02_MEMORY/` holds the *rules about data*. In `02_MEMORY/` live the constitution, the boot protocol, the naming protocol, the decision protocol, the profile standards (what a "supplier profile" must contain, what a "customer profile" must contain), the capability map. The agent reads these documents to know *how to remember*, *how to name new files*, *how to decide what is reversible*. Without this folder, every memory write is improvised. # 03_PROJECTS/ — the active work Real work happens here. Sub-organized by goal area, then by project slug: 03_PROJECTS/areas/{goal}/{slug}/ Each project gets its own folder with a standard skeleton: [`README.md`](http://README.md), [`TASKS.md`](http://TASKS.md), [`CHANGELOG.md`](http://CHANGELOG.md), [`BRIEF.md`](http://BRIEF.md), plus working files. There is a project registry at the top that the agent reads to know what is active versus dormant versus archived. The biggest discipline issue here: **do not let projects sprawl outside their folder.** When working on Project X, every file related to Project X goes inside Proj
View originalThe famous METR AI time horizons graph contains numerous severe errors [D]
Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, [writes](https://www.transformernews.ai/p/against-the-metr-graph-coding-capabilities-software-jobs-task-ai) damningly about the famous METR AI time horizons graph in the Substack publication Transformer: >It is impossible to draw meaningful conclusions from METR’s Long Tasks benchmark — in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut one’s losses and move on in search of higher-quality information. >… The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authors’ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks [even more compromised](https://benchrisk.ai/score) than METR’s. One hopes that as the field matures, its participants will learn to stop making these mistakes. The errors include: * Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors * A key variable in the data is how long it takes humans to complete certain tasks, but — when METR did actually measure this — it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer * The sample of human benchmarkers was biased toward METR employees’ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased) * Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had to spend time familiarizing themselves the codebase and the task at hand * Test-training data contamination occurred because some of the tasks had published solutions online, which most likely would have been included in LLMs’ training datasets * And many more Please read the [full post](https://www.transformernews.ai/p/against-the-metr-graph-coding-capabilities-software-jobs-task-ai). It’s not too long and it’s accessible to general audience. It’s worthwhile to read the whole post and see how many errors were made in the creation of the METR graph and just how bad they are. If you want to read about *even more* errors in the METR graph not covered in Nathan Witkin’s post, read [this post](https://garymarcus.substack.com/p/the-latest-ai-scaling-graph-and-why) by the AI researchers Gary Marcus and Ernest Davis. The METR graph is a great example of why scientific standards and best practices are so important, and why enforcing them through processes like peer review is necessary to prevent us from drowning in bad information. It’s extremely dangerous to rely on information that only superficially appears scientific but wasn’t actually conducted with the rigour normally required of scientific research.
View originalWhy is there a sudden demand for a bunch of data centers?
I live in Pennsylvania, and in just the past year there’s been about a dozen data centers proposed within a 30 mile radius of me, all pretty large scale projects. I’m confused because we have a bunch of AI now that’s working without all these newly proposed data centers. I understand it continues to advance and grow, but why is there such a significant spike? Is there actually demand, or are these going to be mostly unused?
View originalA CEO built his own AI agent with Claude MCP + NetSuite. It worked. Then it didn't scale.
How many of you have a prototype that demos great and then falls apart the moment real users touch it? Yeah. This is that story, except the person who built the prototype was the CEO himself. S&B Filters, a U.S. manufacturer with 700+ employees, runs its entire operation on NetSuite. Their CEO wired up Claude's MCP connector to NetSuite, wrote his own prompts, and got an internal AI assistant working for order status lookups. Legit impressive for a solo build. Then the fun part: 4–6 minute response times, a 40-page prompt holding the whole thing together, PO numbers coming in different formats from Shopify, phone, and email, and zero path to putting this in front of actual customers. He came to us basically saying, "I proved it works, now make it work for real." We didn't patch the prototype. Our team at BotsCrew rebuilt the whole stack around NetSuite as the source of truth. We built an input normalization layer that validates across formats, falls back across identifiers (Sales Order > PO > customer reference), and uses conversation context when the input is garbage. This was 80% of the engineering challenge. Then: two interfaces off one backend, an internal assistant for the support team, and customer-facing on the website. Same AI layer, different access controls. Beyond order lookups, installation guides, compatibility checks, and technical inquiries with images and videos. Dynamic knowledge base via OneDrive, updated by the client without redeployment. Results: * \~50% of support requests are fully automated * 24x faster first response * \~$140K/year in savings * \~250% ROI in Year 1 Now they're expanding into full order management, dealer identification, and personalized discounts through the same system. One prototype turned into a full AI program. If you want to read the full case study with screenshots and more technical details, I'll drop the link in the comments.
View originalAI Whistleblower: We Are Being Gaslit By AI Companies, They’re Hiding The Truth! - Karen Hao
Here is a recent interview with technology journalist Karen Hao (author of Empire of AI). She provides a highly critical look at how major AI companies, specifically OpenAI, operate and the narratives they use to maintain control. To help spark the conversation, here are 5 critical points from the interview. I'm curious what you all think about her assessment? [00:10:05] Shaping the Narrative: Hao argues that executives intentionally fabricate existential risk narratives to secure immense funding and maintain exclusive control over the technology's development, framing themselves as the only ones capable of managing it. [00:42:11] Internal Instability: Sam Altman was temporarily fired in 2023 because key OpenAI board members and executives felt his leadership style was dangerously chaotic for a company building such consequential technology. [01:23:35] Labor Exploitation: The push for AI is already displacing middle-tier jobs, pushing professionals into low-paying, highly stressful data annotation work required to train the very models replacing them. [01:49:25] Environmental Crisis: The massive supercomputers required to scale AI are creating severe environmental strains, heavily polluting the air and draining water resources in vulnerable communities. [01:55:04] Bicycles vs. Rockets: Instead of building massive, resource-heavy generalized language models ("rockets"), Hao argues we should focus on highly specialized, low-cost AI tools ("bicycles") like AlphaFold that offer immense public benefit with minimal harm.
View originalClaude Token Optimisation - 70% reduction doing this.
Hitting your Claude subscription limit too often? Try this... Your Claude bill aren't too high, the problem is that you're just running the wrong model on the wrong tasks. Like taking a Ferrari to do the grocery run. Instead of everyone running their own skills build an environment where every skill your team runs gets logged centrally. Everyone accesses the same library of prompts, workflows, and model calls. No duplicated work and no siloed setups. The model routing is where 70% of token savings comes from because not every task needs Opus 4.7. Data lookups run on Haiku. The analysis layer runs on Sonnet. Opus earns its cost only on work that genuinely requires it. Whilst tokens feel cheap right now this won't stay that way as your team scales. Building this routing infrastructure today is how you avoid an AI bill that surprises you 12 months from now. Here's one example of what a production-grade Claude setup looks like when you're running it across a whole business of 12 staff.
View originalEdgeModel
**The idea:** **A platform where:** 1. Businesses can find specialized AI models (not general ChatGPT-style APIs) 2. Developers can train and sell AI models optimized for specific business use cases 3. Models are designed for edge deployment (low cost, offline, fast inference) 4. Everything is focused on reducing AI API costs and improving performance for real business workflows **Think:** Instead of paying high API costs for generic AI businesses use smaller, optimized models tailored to their exact use case. (OCR, surveillance, retail analytics, automation, etc.) **And developers earn money by:** 1. Selling trained models 2. Offering optimized deployments 3. Customizing models for businesses **The problem I’m trying to solve:** **A lot of companies are:** burning money on AI API calls struggling with latency and scaling costs unable to deploy AI models locally or efficiently relying on generic models that are not optimized for their workflows My question to you: **Would businesses actually use something like this instead of just using OpenAI / APIs?** **If you are a developer, would you bother uploading/selling models like this?** **What would stop you from trusting or using a platform like this?** **Is this solving a real problem or does it sound unnecessary?** **Most importantly, would you personally sign up for something like this?** I would much appreciate if I can get some honest feedback from you all! I’m not looking for validation, I want to know if this is actually needed in the market or just sounds good but won’t get real adoption. Appreciate any insights, especially from people who’ve built or used AI products in production.
View originalKey features include: We set the benchmark for what’s possible with AI, Introducing Scale Labs, Scale AI and BAE Systems Combine Forces to Modernize the Tactical Edge.
Scale AI is commonly used for: Image classification for computer vision, Natural language processing for sentiment analysis, Object detection in autonomous vehicles, Speech recognition model training, Medical image analysis, Content moderation for social media platforms.
Scale AI integrates with: Amazon S3, Google Cloud Storage, Kubernetes, Slack, Jupyter Notebooks, TensorFlow, PyTorch, Microsoft Azure, DataRobot, Apache Airflow.
Based on user reviews and social mentions, the most common pain points are: API costs, token usage, cost tracking, openai bill.
Based on 126 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
Robert Nishihara
Co-founder at Anyscale / Ray
3 mentions