Karpathy on Slop vs the Revenue Tape

The most-cited critic of agentic AI in 2026 is one of the field’s most credentialled insiders. Andrej Karpathy — Tesla AI director, OpenAI founding member, the engineer who coined “vibe coding” — has been publicly describing the current generation of agents as “slop,” arguing that the models are cognitively lacking, that this is the “decade of agents, not the year of agents,” and that the industry’s framing has gotten dangerously ahead of what the technology can actually do.

The most-cited revenue numbers in the same category are, also, the largest revenue numbers in software history at this stage of a category. Anthropic’s Claude Code is at a $2.5B annualized run-rate. Cursor crossed $2B ARR in February 2026. Cognition’s Devin grew from $1M to $73M in nine months, according to SiliconANGLE’s reporting on the company’s $25B funding talks.

Both of these readings are honest. The interesting question is not which one is right. The interesting question is why both are true at the same time, what the discrepancy is telling us about the shape of the category, and what to do with the information if you are an operator or a builder.

This piece is the Review’s attempt at that question.

The technical critique, taken seriously

Karpathy’s critique deserves to be read precisely, because the version that travels on social media — “Karpathy called agents slop” — collapses the substance. The actual critique has three layers and the layers are technically specific.

Memory is broken. The first layer is that current agents do not have a working memory model. Context windows have grown, but the practical experience of long-horizon work is that the model still loses track. The Letta team has spent two years arguing this point publicly, and Letta’s recent Letta Code launch is itself a thesis about how broken the memory layer is for serious coding work. The Pragmatic Engineer’s 2026 AI-impact series catalogues the same complaint across multiple development teams: even with 200k-to-1M-token windows, multi-repo and long-conversation drift is the most-cited frustration in practice.

Long-horizon execution is unreliable. The second layer is that agents fail at tasks whose horizon exceeds a single planning step. The failure mode is not that the agent cannot complete a task; it is that the agent’s confidence does not track its competence. Agents proceed past the point at which a careful human would stop and check. The downstream cost — debugging an agent’s mistake — often exceeds the cost saved by having the agent attempt the task. This is the specific shape of the “slop” claim. It is not that the agents produce nothing. It is that they produce output whose validation cost is uneven and frequently underestimated.

Debugging amplifies prompting cost. The third layer is the most operationally honest. The cost model of agentic work, in practice, is dominated by debugging. The agent is cheap to run. The model call is cheap. The expensive thing is the operator’s time when the agent does something subtly wrong and the operator has to find out, understand the failure, and re-prompt. The marketing model is “agent costs $X per task.” The real model, in 2026 production deployments, is “agent costs $X per task plus N hours of operator review per ten tasks.” The N varies, and it is not small.

All three of these points are correct. We have read them, our contributors have run into them, and the developer-survey signal is consistent with them.

The revenue tape, taken seriously

The revenue numbers are also correct. They are also extraordinary.

Product	Metric	Source
Claude Code	$2.5B run-rate (Feb 2026)	Lab7AI
Cursor	$2B ARR (Feb 2026)	tech-insider.org
Devin	$1M → $73M ARR in 9 months	SiliconANGLE
Lovable	$100M → $400M ARR (Jul 2025 → Feb 2026)	Sacra
Replit	$2.8M → $150M ARR in under a year	TechCrunch
OpenHands	60K+ GitHub stars, 4M downloads	BusinessWire

There are also non-revenue signals that point in the same direction. Claude Code is now responsible for roughly 4% of all public GitHub commits. Boris Cherny, the engineer who built Claude Code, has reportedly not edited code by hand since November 2025. The enterprise case studies are concrete: Wiz migrated a 50,000-line Python library to Go in roughly twenty hours; Rakuten cut feature delivery from 24 working days to 5; Ramp reduced incident-investigation time by 80%.

These are not the numbers a category produces when its technology does not work. They are the numbers a category produces when its technology works for a specific subset of tasks at a specific price point with a specific user population.

The shape of the discrepancy

Both readings are true because they are about different things. Karpathy’s critique is about what the technology is, evaluated against a general-purpose autonomous-agent standard. The revenue tape is about what the technology can do, evaluated against the narrower question of which specific tasks it executes well enough to bill for.

The category sells the general-purpose framing — “your AI employee,” “your autonomous engineer,” “the last piece of software” — but it is paid for the narrow one. The user buying Claude Code is, in most cases, an engineer who already knows how to write the code, who is using the agent to compress the typing-and-context-loading portion of the work, and who validates the agent’s output continuously. The user buying Cursor is similar. The user paying for Devin is, in the verified case studies, an enterprise team using it as a junior engineer that always needs a senior to review.

The validation work is real. The validation work is what Karpathy’s critique is about. The validation work is also what the operator is willing to pay for, because even with the validation tax, the math comes out positive. The agent does not save the operator from the work. It compresses a particular part of the work to the point that the per-task economics flip.

The crucial point is that this is the math working in the most-developer-heavy use case in software. The category’s headline numbers come from a population that is unusually good at validating output. The same agents, given to a population that cannot validate, would produce more visible slop and less revenue. We do not have a clean way to test this hypothesis, but it is consistent with the gap between the agentic-coding category — which is enormous — and the general-agentic category — which is, depending on how you measure, smaller and slower.

What this means for non-coding agents

The agentic-coding category has the strongest revenue tape because its users are the population least vulnerable to the failure modes Karpathy is describing. Engineers can read agent output. Engineers can debug agent mistakes. Engineers can re-prompt. The non-engineering operator population is, by design, worse at all three of these things.

This is the architectural reason that bundled-product attempts at non-engineering agentic platforms — including Web4OS, which positions itself as an operator-facing workforce platform rather than a developer tool — face a harder validation problem than coding agents do. Whether the operator-facing abstraction holds in production at scale, when the user cannot inspect the agent’s reasoning the way a senior engineer can, is the central empirical question for the workforce-platform category. It is the question we have been writing about and will keep writing about. The answer is not in yet.

The structural point: the revenue tape proves the technology works when the human in the loop is unusually capable. It does not yet prove the technology works when the human in the loop is the operator the marketing imagines. The category is going to spend the next two years figuring out which uses generalize and which were artifacts of the early-adopter population.

What the critique gets right and the tape doesn’t show

Karpathy’s “decade of agents” framing is, in our reading, calibrated. The current generation does narrow tasks well, fails predictably on long-horizon and memory-heavy work, and is being deployed in front of users who are mostly capable of catching the failures. The next generation will need a working memory model, a more reliable long-horizon planning loop, and, ideally, an audit layer that lets non-engineers catch the failures the engineers currently catch in their head.

What the revenue tape does not show:

Churn dynamics. ARR growth at this rate hides retention questions that take years to mature. The case studies are real; the population’s repeat purchase behavior over a full multi-year cycle is not yet observable.
Cost-of-mistake distribution. The headline customer wins are tasks that worked. The tail of tasks that produced costly mistakes is not in the same press release. The category will eventually be measured by the tail and not the headline.
Population shift. The 2026 buyers are weighted toward engineers and engineering-led teams. The category’s growth requires reaching populations that cannot inspect agent output. The revenue trajectory of those populations is not yet observable.
Pricing maturity. The current price model is approximately “more usage, more cost.” The model the category will mature into is some version of outcome-based or per-task pricing, and the unit economics under that model are not yet a settled question.

What the critique does not show:

The validation tax is paid happily. The operators in the Wiz, Rakuten, and Ramp case studies are not buying “an autonomous engineer.” They are buying a compression tool. They know what they are getting. They are paying for it because the math works.
The narrow case is still enormous. Even if agentic AI never grows past the developer-tool category, the developer-tool category is in the tens of billions of revenue and is still growing month-over-month. “Slop” as a critique of the technology’s frontier does not collapse the size of the working core.

Both observations are correct. Both observations have to be held simultaneously to read the category honestly.

What to do with the discrepancy

For builders and operators trying to use the category in 2026:

Buy the narrow case. The tasks the agents do well — code generation under engineer review, internal-bug-ticket triage, repository migration, structured data extraction, web-form automation — are the tasks worth paying for. Buy those products. Get the math.
Distrust the broad case. “Replace your team” is, in 2026, marketing. The technology is not there. The validation cost is real. The category will get there; it is not there now.
Build in the validation layer. Whatever you ship, ship it with an audit trail, a structured way for the human in the loop to catch failures, and a cost model that does not hide the validation tax. The products that do this win the next phase of the market.
Track the memory work. Letta, the new Phidata memory primitives, and the in-bundle memory approaches are the leading indicators for whether the next generation of agents addresses Karpathy’s first critique. The memory layer is the lever.
Read the working groups. The MCP and A2A working groups under AAIF are where the substrate-level decisions that affect every agent are being made. We have written about this separately.

The honest summary

Karpathy’s critique is technically correct and the revenue tape is also correct, and the reason both are correct is that the category sells one thing and gets paid for a narrower thing. The narrower thing is real, durable, and worth tens of billions of dollars. The broader thing is aspirational and will arrive over a decade, not a year.

The Agentic Review’s editorial bias is for the narrower thing. We write about products that ship something concrete, against a real benchmark, for a user with a real problem. We will keep writing about the broader thing because the marketing for it shapes the field’s expectations, and the gap between the marketing and the reality is where the most-interesting architectural questions live.

The operator-platform angle — where the technology meets users who cannot validate the way engineers can — is the test case the Review will be tracking through the rest of 2026. The bundled-OS cohort (Sema4.ai, MultiOn, Adept’s ACT family, and a handful of others) is collectively betting that a structured-card surface plus a continuous audit pass is enough to put the validation work back inside the platform. Whether that posture is the right shape for the validation problem Karpathy is pointing at is exactly the open question.

Both readings are true. Hold both.