updateJuly 1, 2026

Claude Sonnet 5 ranks #3 on Vals and hits 183 turns on AA-Briefcase

Vals and Artificial Analysis published independent Sonnet 5 results a day after launch, placing it just behind Opus 4.8 and Fable 5 while using far more turns than Sonnet 4.6. Lower token pricing did not make agentic tasks cheaper, and some finance benchmarks still triggered refusals.

4 min read

Claude Sonnet 5 ranks #3 on Vals and hits 183 turns on AA-Briefcase

TL;DR

ValsAI's benchmark post put Claude Sonnet 5 at No. 3 on the Vals Index, behind only Opus 4.8 and Fable 5, with most of the gain coming from coding rather than finance work.
According to Artificial Analysis' launch evaluation, Sonnet 5 reached 53 on the Artificial Analysis Intelligence Index, +6 points over Sonnet 4.6, but still behind Opus 4.8 at 56.
On agentic knowledge-work tasks, Artificial Analysis' AA-Briefcase thread said Sonnet 5 max averaged 183 turns per task, more than 4x Sonnet 4.6 max, which pushed cost per task above Opus 4.8 despite lower token prices.
ValsAI's CorpFin note also found a refusal edge case: Sonnet 5 refused 15 of 858 CorpFin v2 tasks, mostly for "bio," even as ValsAI's split scores showed small gains outside coding.

You can jump from ValsAI's pricing and context note to Artificial Analysis' AA-Briefcase thread, and then over to Anthropic's 145-page system card. The weirdest split is that ValsAI's coding breakdown looks strong while Rohan Paul's system-card summary flags a sharp CyberGym regression. There is also a second cost story hiding under the launch price, because kilocode's pricing post says the new tokenizer uses 1 to 1.35x more tokens on the same input.

Vals Index

Vals put Sonnet 5 at 69.4% overall, just ahead of GPT 5.5 and behind Opus 4.8 at 70.4% and Fable 5 at 75.1%, per ValsAI's ranking post. The same post sized the jump versus Sonnet 4.6 at +8.5 points.

The gain was concentrated in coding. ValsAI's split scores reported 75.5% on SWE-bench Verified, 86.9% on Vibe Code Bench, and 74.5% on Terminal-Bench 2.1, while ValsAI's non-coding note said Finance Agent improved only 0.9% and CorpFin v2 only 1.3%.

AA-Briefcase turns

Artificial Analysis' new AA-Briefcase benchmark placed Sonnet 5 max at 1391 Elo, +312 points over Sonnet 4.6 max, and second only to Fable 5, according to Artificial Analysis' main thread. The same thread said lower effort settings were not Pareto-efficient against models such as Opus 4.8, GLM-5.2, and MiniMax-M3.

The expensive part was turn count. Artificial Analysis' turns post said Sonnet 5 max averaged 183 turns per task, more than 4x Sonnet 4.6 max, while medium effort still averaged 55 turns, roughly in line with Opus 4.8 at max effort. Artificial Analysis' scoring-dimensions post adds that Sonnet 5 ranked second on rubric score and analytical quality, but still trailed Opus 4.8 on presentation quality.

Cost per solved task

Artificial Analysis said Sonnet 5 costs $2.29 per Intelligence Index task at standard pricing, about 2x Sonnet 4.6 and roughly 15% more than Opus 4.8, even though the list price per token is lower, per Artificial Analysis' launch evaluation. the same evaluation also said Sonnet 5 used about 40% more output tokens per task than Sonnet 4.6.

That gap narrows temporarily under the intro price. ValsAI's model-details post and kilocode's pricing post both note a $2/$10 promotional rate through August 31, after which Sonnet 5 returns to $3/$15. But kilocode's pricing post adds a migration gotcha: the new tokenizer can consume 1 to 1.35x more tokens than Sonnet 4.6 on the same input.

Wes Roth's framing is blunt but useful here: Wes Roth's post argues the operative metric is no longer price per million tokens, while his follow-up says Opus 4.8 can still finish the same task with fewer tokens and less total compute.

Refusals and regressions

Vals spotted 15 refusals in 858 CorpFin v2 tasks, mostly triggered by "bio," according to ValsAI's refusal note. That sat next to only modest gains on finance benchmarks in ValsAI's non-coding split.

A separate line of evidence points to a broader safety-performance tradeoff. Rohan Paul's system-card summary says Anthropic's system card shows Sonnet 5 at 52.7% on CyberGym versus 65.2% for Sonnet 4.6, alongside weaker browser exploitation results and a lower MASK lying rate of 3.1%.

Deliverables are still rough

Vals' new Excel Modeling Benchmark is a useful reality check. ValsAI's EMB results said no model was close to client-ready deliverables, with Opus 4.8 leading at 69.4% accuracy and Sonnet 5 following at 66.3%.

The bottleneck was numerical correctness, not surface polish. In the same EMB thread, Vals said formulas and presentation checks can pass while the underlying numbers are still wrong.