breakingJune 4, 2026

Cognition launches Devin Productivity Guarantee with $10M cap

Cognition said it will fund Devin usage up to $10 million when measured engineering value falls below cost, and published a technical writeup estimating productive engineering hours per session. It matters because the company is shifting agent pricing from tokens to claimed output and extending coding evaluation toward much longer task horizons.

5 min read

Cognition launches Devin Productivity Guarantee with $10M cap

TL;DR

cognition's launch post introduced an "AI Productivity Guarantee" that covers Devin usage with free credits when measured engineering value comes in below cost, capped at $10 million.
The linked Devin guarantee page, also surfaced in cognition's eligibility post, limits the program to enterprise customers deploying Devin Cloud at meaningful scale and meeting technical and engagement requirements.
In cognition's technical-writeup post, the company said it runs an estimator on every completed Devin task, first classifying whether the work was useful, then estimating equivalent human engineering hours.
Cognition's methodology post, echoed by swyx's breakdown, says the held-out estimator hit an rlog of 0.74 on 233 sessions after being trained on 258 sessions from 126 users across enterprise customers.
The same writeup is also a benchmark story: swyx framed it as a move past METR's roughly 16-hour ceiling, while scaling01 argued Cognition's new benchmark could still saturate quickly if model time horizons keep doubling.

You can read the guarantee terms, the full measurement writeup, and the launch framing in Scott Wu's post. The oddest detail is how much of the announcement is really about evaluation plumbing: a productivity classifier, a calibrated hour estimator, and a conservative payout scheme that only kicks in for large Devin Cloud rollouts.

Guarantee terms

The commercial hook is simple: Cognition compares Devin's output against usage, and if the measured engineering value falls short, it issues free usage credits to close the gap, up to $10 million. The company is explicitly pitching that as a break from token and activity metrics, with ScottWu46 arguing the industry should move from measuring sessions and spend to measuring output.

The official guarantee page adds the main constraint missing from the headline posts. The offer is for enterprise customers running Devin Cloud at meaningful scale, not for every self-serve seat, and enrollment still depends on technical and engagement requirements confirmed by Cognition's account team.

Measurement stack

The guarantee depends on a two-stage estimator described in Cognition's methodology post:

Productivity filter: each completed Devin session is classified as productive or not.
Hour estimator: productive sessions are converted into equivalent human engineering hours.
Dollar conversion: those hours are translated into dollar value using engineering salaries.
Guarantee comparison: delivered value is compared against usage, and any shortfall is returned as credits.

Cognition says the ground-truth set came from 258 sessions reviewed by 126 users across enterprise customers. For sessions with PRs, merged PRs are counted and closed PRs are dropped; for sessions without PRs, a classifier keeps things like bug triage, dependency cleanup, security scans, analytics queries, and code review, while filtering cases where Devin lacked access or never got the clarification it needed.

The estimator itself uses more than the final diff. According to the writeup, it looks at user messages, produced PRs, the full agent trace, and extra codebase context from DeepWiki, then tries to reason about the human path to the same outcome rather than Devin's own detours.

Calibration limits

Cognition's most concrete metric is an rlog of 0.74 on a held-out set of 233 sessions, with about half of sessions landing within a factor of two of the human estimate, per the technical writeup. The company also says the model is good enough for aggregate planning, not for trusting every single session-level estimate.

The writeup is unusually explicit about where the noise comes from. It says individual predictions regularly miss by 2 to 3 times in either direction, that roughly half the residual disagreement is between users rather than within one user's own sessions, and that the summed estimates still undercount the human total because an estimator that is unbiased in log-space stays conservative when totals are added in linear space.

It also reports what did not work well. A simple predictor based only on lines changed reached an Rlog² of 0.27, far below the full estimator, which is a direct argument against treating diff size as a proxy for engineering value.

Benchmark horizon

Buried inside the pricing announcement is a second claim: Cognition is trying to stretch coding evaluation toward longer real-world tasks. swyx contrasted the company's private enterprise evals, which he said run up to 100 hours, with METR's public work that tops out around 16 hours.

The official writeup is narrower than the tweet gloss. In its comparison section, Cognition's methodology post says METR estimated human-equivalent times from 34 Claude Code transcripts labeled by seven technical staff and reached an rlog of 0.83, while Cognition's own lower 0.74 likely reflects a more diverse enterprise dataset. That makes this launch less about beating METR on correlation and more about moving the measurement target into production Devin sessions.

The pushback arrived immediately. scaling01 argued that a benchmark good up to about 64 hours could still be obsolete before year end if coding-agent time horizons keep doubling, which turns the benchmark itself into a moving target rather than a fixed scoreboard.