Skip to content
AI Primer
release

Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking

Anthropic launched Claude Sonnet 5 across Claude, the API, and Claude Code with 1M context, adaptive thinking, and $2/$10 intro pricing through Aug. 31. Independent evals place it near Opus 4.8 on coding and tool use, so teams should benchmark it against Opus before switching.

9 min read
Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking
Anthropic launches Claude Sonnet 5 with 1M context and adaptive thinking

TL;DR

  • Anthropic shipped Claude Sonnet 5 as the new default across Claude, the API, and Claude Code, with a 1M context window, adaptive thinking by default, and introductory $2 per million input tokens and $10 per million output tokens through Aug. 31, according to claudeai's rollout post, ClaudeDevs' product thread, and ClaudeCodeLog's changelog.
  • Anthropic's own chart puts Sonnet 5 well above Sonnet 4.6 on SWE-bench Pro, Terminal-Bench 2.1, Humanity's Last Exam, and OSWorld-Verified, while ClaudeDevs' benchmark table still shows Opus 4.8 ahead on every official benchmark except GDPval-AA v2, where Sonnet 5 is up by 3 points.
  • The fine print is the real story: footnote 2 in says Sonnet 5 uses a new tokenizer that can map the same input to roughly 1.0x to 1.35x more tokens, and Artificial Analysis found the model cost 15% more per task than Opus 4.8 at standard pricing.
  • The model also changes behavior, not just scores: ClaudeDevs' adaptive thinking note says extended thinking is gone, ClaudeDevs' agent note says unattended runs recover better and self-check more often, and ClaudeDevs' migration post points to a built-in /claude-api skill for prompt tuning and advisor mode.
  • Day-one rollout was wide. Cursor, Cognition, Vercel, Perplexity, and Warp all shipped Sonnet 5 support on launch day.

You can open Anthropic's announcement, skim the prompting guide, and compare it against Simon Willison's notes on the tokenizer. The ecosystem rollout was fast enough that Perplexity added it as a Computer orchestrator, Hyperbrowser plugged it into browser agents, and Conductor exposed a 1M-context Sonnet 5 selector the same day.

What shipped

  • Default surfaces:
  • API and model IDs:
  • Context and outputs:
  • Pricing:
  • New defaults:

Benchmarks that moved

First-party

  • SWE-bench Pro: 58.1% → 63.2%, +5.1 points, per ClaudeDevs' chart.
  • Terminal-Bench 2.1: 67.0% → 80.4%, +13.4 points, per the same chart.
  • Humanity's Last Exam, no tools: 34.6% → 43.2%, +8.6 points, per the same chart.
  • Humanity's Last Exam, with tools: 46.8% → 57.4%, +10.6 points, per the same chart.
  • OSWorld-Verified: 78.5% → 81.2%, +2.7 points, per the same chart.
  • GDPval-AA v2: 1395 → 1618, +223 points, per the same chart.
  • BrowseComp, single agent: 76.2% → 84.7%, +8.5 points, per .
  • BrowseComp, multi-agent: 76.2% → 86.6%, +10.4 points, per .
  • AutomationBench: 5.3 → 13.5, +8.2 points, per .
  • HealthBench Professional: 44.2 → 57.8, +13.6 points, per .

Third-party evaluators

Customer-reported

  • Box AI Complex Work Eval, Energy domain: Sonnet 4.6 baseline → Sonnet 5, +4.7 points, per Aaron Levie's benchmark thread.
  • Box AI Complex Work Eval, Retail domain: Sonnet 4.6 baseline → Sonnet 5, +4.4 points, per the same thread.
  • Box AI Complex Work Eval, Professional Services: Sonnet 4.6 baseline → Sonnet 5, +2.6 points, per the same thread.
  • Ramp SWE-Bench harness costs: Sonnet 4.6 baseline → Sonnet 5, higher cost and more tokens, per RampLabs' hands-on run.

Where it regressed

Anthropic's own comparison table still leaves Opus 4.8 ahead on every official benchmark except GDPval-AA v2. ClaudeDevs' chart shows Sonnet 5 trailing Opus 4.8 by 6.0 points on SWE-bench Pro, 2.3 points on Terminal-Bench 2.1, 6.6 points on Humanity's Last Exam without tools, 0.5 points with tools, and 2.2 points on OSWorld-Verified.

The strangest drop is cyber.

puts Sonnet 5 at 52.7% on targeted vulnerability reproduction, down from 65.2% for Sonnet 4.6, and rohanpaul_ai's system-card summary notes Anthropic explicitly said Sonnet 5 was not deliberately trained for cyber tasks.

The cost-performance story also bends the wrong way at higher effort. Artificial Analysis said Sonnet 5 used about 40% more output tokens per Intelligence Index task than Sonnet 4.6 and landed at $2.29 per task under standard pricing, versus $1.80 for Opus 4.8, while bridgemindai's CursorBench screenshot showed Sonnet 5 Max scoring 61.2% at $6.87 per task versus Opus 4.8 Max at 63.8% and $7.59.

Anthropic did answer part of that criticism in its launch footnote.

says the updated tokenizer can inflate token counts by up to 35%, and that the intro price was set to make the move from Sonnet 4.6 "roughly cost-neutral."

Under the hood

The release changed the control surface in a few concrete ways.

  • Thinking:
  • Tokenization and billing:
  • Vision and computer use:
  • Agent behavior:

Vibe Check

Hands-on reports clustered around the same behavioral shift: Sonnet 5 works harder, finishes longer runs more often, and is less obviously cheap once you push it.

  • Steve Yegge said Sonnet 5 dropped Opus 4.8's habit of inventing small complaints, a tiny detail that says more about review quality than another benchmark table.
  • RampLabs said the model ran more tests, spent more time in the harness, and emitted more tokens than Sonnet 4.6.
  • skeptrune said swapping a lighter support assistant from Sonnet 4.6 to Sonnet 5 made sense, but moving a heavier content-authoring agent from Opus to Sonnet would not.
  • skirano argued the model felt sharper on long-horizon work than the official benchmark gap suggested, while skirano's follow-up said the change showed up in the "working mind" more than in headline scores.
  • bridgemindai's car-wash test is silly evidence, but it does show the flavor of the release: people immediately started probing for plainspoken judgment, not just raw code output.

Where it shows up

The partner rollout was unusually broad for day one.

  • IDEs and coding agents:
  • Inference and gateways:
  • Consumer and agent products:

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR2 posts
What shipped5 posts
Benchmarks that moved4 posts
Where it regressed2 posts
Under the hood7 posts
Vibe Check4 posts
Where it shows up15 posts
Share on X