Skip to content
AI Primer
update

Opus 4.8 users report token burn, failed tool calls, and DeepSWE gaps

Three days after Opus 4.8 launched, new tests and field reports added failed tool calls, Bash-specific breakdowns, and higher token burn to the complaint list. Users report materially worse cost and stability in long coding sessions, while DeepSWE and GBA Eval point in different directions.

6 min read
Opus 4.8 users report token burn, failed tool calls, and DeepSWE gaps
Opus 4.8 users report token burn, failed tool calls, and DeepSWE gaps

TL;DR

Anthropic's launch post sold Opus 4.8 as a same-price upgrade with dynamic workflows in Claude Code, a faster mode, and better agentic reliability. You can check the live DeepSWE leaderboard, the DeepSWE methodology writeup, and the tiny but brutal GBA Eval benchmark. Anthropic also shipped an Opus 4.8-specific Claude Code fix in v2.1.156 hours after launch, and new GitHub issues kept piling up after that.

DeepSWE

DeepSWE is built for long-horizon repo work, not short patch tasks. Datacurve's methodology post says popular coding evals average about 120 lines of code per solution, while DeepSWE's reference solutions average 668 lines.

On the public board, GPT-5.5 xhigh led at 70% pass@1, $6.61 per task, 21 minutes, and 47k output tokens, while Opus 4.8 max landed at 58%, $12.58, 43 minutes, and 136k output tokens, according to the live leaderboard.

The awkward part is that Opus 4.8 still looks better than 4.7 inside the same benchmark. datacurve's launch-day post said the default high setting scored 6 points above Opus 4.7 xhigh while lowering average cost per task, and koltregaskes' follow-up showed higher pass rates than 4.7 across max, xhigh, and high. The gap that kept dominating the conversation was not 4.8 versus 4.7, it was 4.8 versus GPT-5.5 on score, speed, and token efficiency at the top of the board.

Token burn

Multiple hands-on reports converged on the same complaint: Opus 4.8 burns through session limits fast.

The evidence splits into two buckets:

That lines up with Anthropic's own launch framing only partially. The company said in its announcement that pricing stayed flat and fast mode got cheaper, but flat list pricing does not cancel out a model that emits far more tokens per task.

Tool failures

The roughest complaints were not about benchmark deltas. They were about Claude Code sessions breaking mid-run.

PerceptualPeak later narrowed one failure mode down to shell choice. PerceptualPeak's follow-up said Bash and Git Bash runs were the problem, while the same prompts behaved much better when forced to use PowerShell.

The GitHub issue queue shows the same pattern in more concrete form:

  • Issue #63604 said Opus 4.8 frequently emitted malformed tool_use JSON for MCP tools, causing the entire turn to be discarded even when 4.7 worked.
  • Issue #63792 described signed thinking blocks getting mutated during dynamic tool loading, which triggered 400 errors and strip-and-retry loops that consumed full extra turns.
  • Issue #64102 reported API disconnects, idle Bash calls like echo ping, and token-heavy wake-up attempts that still failed to complete tasks.

Anthropic's first post-launch Claude Code patch was narrowly scoped. The v2.1.156 release note, published early on May 29, says it fixed an Opus 4.8 issue where thinking blocks were modified and caused API errors. That did not stop later reports from theo's Ultracode complaint and theo's follow-up describing broken tool use later the same day.

Benchmark split

The public picture stayed messy because the non-DeepSWE tests did not point in one direction.

On GBA Eval, Opus 4.8 was reported as the new top model. The benchmark's own site says models have to build a Game Boy Advance emulator from scratch in WebAssembly and are graded against Mesen2, which is about as far from a toy repo patch as coding evals get.

DeepsecBench told a different story. cramforce's writeup said Opus 4.8 was cheaper than both 4.7 and Codex in that run, but exported fewer findings, hit fewer vulnerability clusters, and ranked issues more conservatively. That conservative streak also showed up in Anthropic-adjacent discussion around Vending-Bench and Blueprint-Bench 2, where WesRoth's summary of the system card said Anthropic removed 4.7 training focused on business skills after linking it to misaligned behavior.

So the short version is ugly but useful: Opus 4.8 looked stronger on some long-horizon coding and reliability-style evals, weaker on others, and unusually sensitive to the harness wrapped around it.

Launch-day knobs

The launch shipped more than a model swap.

According to Anthropic's launch post and the Claude Code 2.1.154 release notes, the main launch-day changes were:

  • Opus 4.8 replaced 4.7 at the same list price.
  • Claude.ai added user control over effort level.
  • Claude Code defaulted Opus 4.8 to high effort, with /effort xhigh for harder tasks.
  • Claude Code added Dynamic Workflows, which Anthropic described as orchestration across tens to hundreds of background agents.
  • Fast mode for Opus 4.8 was introduced as a research preview on the API, and Anthropic said it runs at 2.5 times the speed.
  • The API docs say Opus 4.8 supports a 1 million token context window by default on Anthropic's API, Bedrock, and Vertex AI, plus a lower 1,024-token minimum cacheable prompt length in the model update notes.

Those knobs help explain why field reports got noisy so quickly. Opus 4.8 was launched together with new effort controls, Dynamic Workflows, and fast mode, so users were not just comparing one model to another. They were comparing whole harnesses, with new defaults, new orchestration paths, and new ways to spend tokens fast.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 5 threads
TL;DR1 post
DeepSWE1 post
Token burn4 posts
Tool failures3 posts
Benchmark split1 post
Share on X