Anthropic updates Opus 4.7 adaptive thinking after SimpleBench and BullshitBench regressions
Anthropic says Opus 4.7 bugs are fixed and adaptive thinking now triggers more often after launch-day complaints about refusals, token burn, and weaker follow-through. Arena rankings improved, but SimpleBench and BullshitBench scored below Opus 4.6, and some users reverted to 4.6 for certain tasks.

TL;DR
- Anthropic shipped Opus 4.7 with a new tokenizer, a new
xhigheffort level, and adaptive thinking as the only supported thinking mode for the model, while the adaptive thinking docs say API callers must explicitly setthinking: {type: "adaptive"}and can no longer use fixedbudget_tokens. - Launch-day complaints about weak follow-through, refusals, and odd behavior got an explicit acknowledgement when alexalbert__'s bug-fix note said many of the first-day bugs were fixed, and emollick's follow-up reported adaptive thinking was already triggering much more often a few hours later.
- The early benchmark picture split hard: arena's Code Arena post put Opus 4.7 at #1 for agentic webdev, while AiBattle_'s Simple-Bench post and petergostev's BullshitBench post both showed regressions versus Opus 4.6.
- The tokenizer change became the immediate operator headache. natolambert's migration screenshot highlighted Anthropic's own 1.0 to 1.35x token warning, and badlogicgames' count_tokens test showed a real file jumping from 1,091 to 1,454 input tokens.
- Hands-on reports stayed mixed even after the fixes: jeremyphoward's early reaction said 4.7 finally "gets" his work, while GergelyOrosz's thread and nummanali's revert post both went back to 4.6 for some tasks.
You can read Anthropic's launch post, the separate Claude Code best-practices post, and the adaptive thinking docs. The most useful community thread is still the main HN discussion, and Anthropic also pushed a same-day Claude Code changelog fix after auto mode started telling people Opus 4.7 was temporarily unavailable.
Adaptive thinking
Anthropic framed Opus 4.7 as a better model for long-running work, with more precise instruction following and self-verification, but the biggest workflow change was hidden in the thinking model itself. In the API docs, Opus 4.7 no longer accepts manual extended-thinking budgets, and adaptive thinking becomes the only supported thinking mode.
That landed badly for people who were used to forcing deep reasoning on every turn. emollick's launch-day complaint said 4.7 regularly chose low effort on analysis and research tasks, while Yuchenj_UW's web app screenshot showed the same frustration from Claude web users who could only pick adaptive or non-thinking.
Anthropic adjusted quickly. alexalbert__'s bug-fix note said many first-day issues were fixed, and emollick's follow-up said adaptive thinking was then firing much more often, including on tasks it had failed the day before.
Token burn
The launch post kept list pricing flat at $5 per million input tokens and $25 per million output tokens, but Anthropic's own migration note said the same input can map to roughly 1.0 to 1.35x more tokens under the new tokenizer.
That showed up immediately in user measurements. In badlogicgames' terminal test, the same README counted as 1,091 input tokens on Opus 4.6 and 1,454 on Opus 4.7. The Claude Code best-practices post adds a second source of inflation: 4.7 tends to think more on later turns in long sessions, especially at higher effort levels.
Claude Code amplified that by raising the default effort level to xhigh, which mattpocockuk's screenshot surfaced on launch day and the best-practices post confirms. Anthropic answered the immediate quota pain with bcherny's rate-limit post, saying subscriber limits were increased to offset the extra thinking-token use.
The complaints kept coming anyway. bridgemindai's launch-day rate-limit screenshot hit 100 percent session usage in two hours, and bridgemindai's later usage post said three prompts burned 13 percent of a session.
Benchmarks
The cleanest summary of Opus 4.7 is that it gained in agentic coding and lost in several “don't get fooled” style evals.
According to arena's Code Arena leaderboard, Opus 4.7 scored 1583 in agentic webdev, up 37 Elo over Opus 4.6 and about 130 points over GPT-5.4-high. arena's Text Arena thread also put Opus 4.7 Thinking at #1 overall in Text Arena and #1 in the Expert and Coding slices.
The regressions were just as visible:
- Simple-Bench: 67.6% on Opus 4.6, 62.9% on Opus 4.7, per AiBattle_'s Simple-Bench post.
- BullshitBench clear pushback: 83% on Opus 4.7 non-thinking and 74% on Opus 4.7 Max, per petergostev's BullshitBench post.
- Extended NYT Connections: 41.0 for Opus 4.7 high reasoning and 15.3 for no reasoning, per LechMazur's Connections benchmark.
- Thematic Generalization: Lech Mazur's first run showed 80.6 to 72.8 from Opus 4.6 high reasoning to Opus 4.7 high reasoning, then his xhigh run still failed to recover the loss, per LechMazur's thematic benchmark thread.
Anthropic's own chart already hinted that the gains were uneven. In the launch graphic, claudeai's benchmark chart showed BrowseComp and CyberGym slightly below Opus 4.6 even while coding, OSWorld, finance, GPQA, and vision moved up.
Hands-on reports
The hands-on reactions were unusually polarized for a point release. jeremyphoward's early reaction called 4.7 the first model that "gets" what he is doing, and bridgemindai's two-Max-plans post said the coding output was unmatched despite the token burn.
A second cluster said the model had become too literal, too argumentative, or too eager to stop and ask permission. nummanali's revert post said 4.7 had turned into an instruction follower and lost some planning instinct, while pvncher's complaint described constant second-guessing instead of execution.
The HN thread converged on the same split. Fresh discussion in the HN delta collected reports of better performance on real Vue, Nuxt, and Supabase projects, but also complaints about two to three minute waits, forced adaptive thinking, and settings tweaks like CLAUDE_CODE_DISABLE_1M_CONTEXT or --thinking-display summarized to get the tool back into a workable shape.
One detail buried in third-party analysis is that some of the benchmark disagreement may come from what the model optimizes for. ArtificialAnlys' thread said Opus 4.7 reduced hallucination in AA-Omniscience by abstaining more often, cutting hallucination rate from 61% to 36% with accuracy largely unchanged. That helps explain why a model can look better in abstention-heavy evals and worse in benches that count refusals as outright failures.
Claude Code fixes
The rollout mess was not just model behavior. The product surface around 4.7 was changing at the same time.
The Claude Code changelog shows a fast patch cadence. The 2.1.112 release note says the only change was fixing the message that claimed claude-opus-4-7 was temporarily unavailable in auto mode. One release later, the 2.1.113 changelog added another Opus 4.7-specific fix for Bedrock users hitting thinking.type.enabled is not supported errors.
The thinking-summary path also changed underneath developers. The adaptive thinking docs make clear that Opus 4.7 uses adaptive thinking blocks, and badlogicgames on summarized thinking flagged the new display requirement early. By the end of the day, nummanali's workaround post was passing around claude --thinking-display summarized, while the linked GitHub issue documented missing summaries in the VS Code extension for Opus 4.7.
Anthropic was also shipping new Claude Code workflows around the model, not just patches. bcherny's dogfooding thread introduced auto mode for Max users, recaps for long-running sessions, /effort for the new xhigh setting, and a /fewer-permission-prompts skill. That made the Opus 4.7 rollout feel like two releases jammed together: a new model, and a new preferred harness for using it.