releaseJune 30, 2026

Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking

Anthropic launched Claude Sonnet 5 across Claude, the API, and Claude Code with 1M context, adaptive thinking, and $2/$10 intro pricing through Aug. 31. Independent evals place it near Opus 4.8 on coding and tool use, so teams should benchmark it against Opus before switching.

9 min read

Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking

TL;DR

Anthropic shipped Claude Sonnet 5 as the new default across Claude, the API, and Claude Code, with a 1M context window, adaptive thinking by default, and introductory $2 per million input tokens and $10 per million output tokens through Aug. 31, according to claudeai's rollout post, ClaudeDevs' product thread, and ClaudeCodeLog's changelog.
Anthropic's own chart puts Sonnet 5 well above Sonnet 4.6 on SWE-bench Pro, Terminal-Bench 2.1, Humanity's Last Exam, and OSWorld-Verified, while ClaudeDevs' benchmark table still shows Opus 4.8 ahead on every official benchmark except GDPval-AA v2, where Sonnet 5 is up by 3 points.
The fine print is the real story: footnote 2 in says Sonnet 5 uses a new tokenizer that can map the same input to roughly 1.0x to 1.35x more tokens, and Artificial Analysis found the model cost 15% more per task than Opus 4.8 at standard pricing.
The model also changes behavior, not just scores: ClaudeDevs' adaptive thinking note says extended thinking is gone, ClaudeDevs' agent note says unattended runs recover better and self-check more often, and ClaudeDevs' migration post points to a built-in /claude-api skill for prompt tuning and advisor mode.
Day-one rollout was wide. Cursor, Cognition, Vercel, Perplexity, and Warp all shipped Sonnet 5 support on launch day.

You can open Anthropic's announcement, skim the prompting guide, and compare it against Simon Willison's notes on the tokenizer. The ecosystem rollout was fast enough that Perplexity added it as a Computer orchestrator, Hyperbrowser plugged it into browser agents, and Conductor exposed a 1M-context Sonnet 5 selector the same day.

What shipped

Default surfaces:
API and model IDs:
Context and outputs:
Pricing:
New defaults:

Benchmarks that moved

First-party

SWE-bench Pro: 58.1% → 63.2%, +5.1 points, per ClaudeDevs' chart.
Terminal-Bench 2.1: 67.0% → 80.4%, +13.4 points, per the same chart.
Humanity's Last Exam, no tools: 34.6% → 43.2%, +8.6 points, per the same chart.
Humanity's Last Exam, with tools: 46.8% → 57.4%, +10.6 points, per the same chart.
OSWorld-Verified: 78.5% → 81.2%, +2.7 points, per the same chart.
GDPval-AA v2: 1395 → 1618, +223 points, per the same chart.
BrowseComp, single agent: 76.2% → 84.7%, +8.5 points, per .
BrowseComp, multi-agent: 76.2% → 86.6%, +10.4 points, per .
AutomationBench: 5.3 → 13.5, +8.2 points, per .
HealthBench Professional: 44.2 → 57.8, +13.6 points, per .

Third-party evaluators

Artificial Analysis Intelligence Index: 47 → 53, +6 points, per Artificial Analysis' index thread.
AA-Briefcase Elo: 1079 → 1393, +314 points, per .
GDPval-AA v2 leaderboard: 1373 → 1600, +227 points, per .
FrontierCode Extended: 33.6% → 53.8%, +20.2 points, per Cognition's FrontierCode post.
CursorBench default run: 49% → 57%, +8 points, per Cursor's launch post.

Customer-reported

Box AI Complex Work Eval, Energy domain: Sonnet 4.6 baseline → Sonnet 5, +4.7 points, per Aaron Levie's benchmark thread.
Box AI Complex Work Eval, Retail domain: Sonnet 4.6 baseline → Sonnet 5, +4.4 points, per the same thread.
Box AI Complex Work Eval, Professional Services: Sonnet 4.6 baseline → Sonnet 5, +2.6 points, per the same thread.
Ramp SWE-Bench harness costs: Sonnet 4.6 baseline → Sonnet 5, higher cost and more tokens, per RampLabs' hands-on run.

Where it regressed

Anthropic's own comparison table still leaves Opus 4.8 ahead on every official benchmark except GDPval-AA v2. ClaudeDevs' chart shows Sonnet 5 trailing Opus 4.8 by 6.0 points on SWE-bench Pro, 2.3 points on Terminal-Bench 2.1, 6.6 points on Humanity's Last Exam without tools, 0.5 points with tools, and 2.2 points on OSWorld-Verified.

The strangest drop is cyber.

puts Sonnet 5 at 52.7% on targeted vulnerability reproduction, down from 65.2% for Sonnet 4.6, and rohanpaul_ai's system-card summary notes Anthropic explicitly said Sonnet 5 was not deliberately trained for cyber tasks.

The cost-performance story also bends the wrong way at higher effort. Artificial Analysis said Sonnet 5 used about 40% more output tokens per Intelligence Index task than Sonnet 4.6 and landed at $2.29 per task under standard pricing, versus $1.80 for Opus 4.8, while bridgemindai's CursorBench screenshot showed Sonnet 5 Max scoring 61.2% at $6.87 per task versus Opus 4.8 Max at 63.8% and $7.59.

Anthropic did answer part of that criticism in its launch footnote.

says the updated tokenizer can inflate token counts by up to 35%, and that the intro price was set to make the move from Sonnet 4.6 "roughly cost-neutral."

Under the hood

The release changed the control surface in a few concrete ways.

Thinking:
Tokenization and billing:
Vision and computer use:
Agent behavior:

Vibe Check

Hands-on reports clustered around the same behavioral shift: Sonnet 5 works harder, finishes longer runs more often, and is less obviously cheap once you push it.

Steve Yegge said Sonnet 5 dropped Opus 4.8's habit of inventing small complaints, a tiny detail that says more about review quality than another benchmark table.
RampLabs said the model ran more tests, spent more time in the harness, and emitted more tokens than Sonnet 4.6.
skeptrune said swapping a lighter support assistant from Sonnet 4.6 to Sonnet 5 made sense, but moving a heavier content-authoring agent from Opus to Sonnet would not.
skirano argued the model felt sharper on long-horizon work than the official benchmark gap suggested, while skirano's follow-up said the change showed up in the "working mind" more than in headline scores.
bridgemindai's car-wash test is silly evidence, but it does show the flavor of the release: people immediately started probing for plainspoken judgment, not just raw code output.

Where it shows up

The partner rollout was unusually broad for day one.

IDEs and coding agents:
Inference and gateways:
Consumer and agent products: