Claude Code reports Opus 4.6 quality drop as BridgeBench retest falls to 68.3%
Fresh retests and issue threads point to worse Claude Code behavior, with Opus 4.6 falling to 68.3% on BridgeBench and users surfacing buried reasoning-effort controls. Track quota burn, hidden effort settings, and rollback reports before assigning more coding-agent work.

TL;DR
- BridgeMind's retest says Claude Opus 4.6 fell from 83.3% to 68.3% accuracy on BridgeBench's hallucination benchmark in one week, dropping from #2 to #10 on the leaderboard.
- In issue #42796, one of the biggest Claude Code bug threads this month, a power user tied worse complex-engineering behavior to a sharp drop in visible thinking depth across 17,871 thinking blocks and 234,760 tool calls.
- According to pvncher's reply and Hacker News comment from Anthropic's Boris, the effort control is now buried, global, and wrapped in Opus 4.6's adaptive thinking behavior, with medium effort becoming the default on March 3.
- a fresh quota bug report says a Pro Max 5x user exhausted quota in 1.5 hours under moderate usage, while issue #46829 claims cache TTL may have silently shrunk from one hour to five minutes.
You can compare the BridgeBench leaderboard, read the sprawling issue #42796, and check Anthropic's own common workflows doc, which still says extended thinking is enabled by default. There is also a March issue about Opus always reopening at medium effort, plus Anthropic's Pro and Max help article, which confirms Claude and Claude Code share the same subscription limits.
BridgeBench
The cleanest new datapoint is the retest. BridgeMind's post shows the same model name, Claude Opus 4.6, scoring 83.3% accuracy and 16.7% fabrication in the earlier run, then 68.3% accuracy and 33.0% fabrication in the April 12 rerun.
That benchmark is narrow, code-analysis hallucination, but the direction lines up with the complaints. haider1 said Opus 4.5 was testing more and breaking less, while 0xblacklight argued the regression looks like a mix of lower default reasoning effort and weaker instruction adherence in the 1M-context model.
Issue 42796
[MODEL] Claude Code is unusable for complex engineering tasks with the Feb updates
1.4k upvotes · 753 comments
The deepest evidence is still issue #42796. The author mined 6,852 Claude Code session files and argued that estimated thinking depth fell about 67% by late February, before thinking redaction fully rolled out in March.
The same report measured behavior changes that are easier to reason about than hidden chain-of-thought. Its read-to-edit ratio fell from 6.6 to 2.0, edits without prior reads rose from 6.2% to 33.7%, and full-file writes roughly doubled. Those are ugly numbers for anyone using Claude Code as a long-running coding agent instead of a chat tab.
Anthropic did not accept the redaction theory as the root cause. In Boris's HN comment, a Claude Code team member said the redact-thinking-2026-02-12 header is a UI change that hides thinking from users and should not affect reasoning itself.
Effort settings
The explanation Anthropic did offer centers on two February changes. In that same HN comment, Boris said Opus 4.6 switched to adaptive thinking on February 9, and medium effort, value 85, became the default on March 3.
Anthropic's own common workflows doc says extended thinking is enabled by default and that Opus 4.6 and Sonnet 4.6 dynamically allocate reasoning based on the effort level setting. The settings doc also makes clear that Claude Code has a global user scope, which fits pvncher's observation that the effort setting is global rather than model-specific.
A separate March issue about effort persistence complained that Opus always reopened at medium reasoning effort and did not remember a max-effort preference. That makes the buried-setting theory look less like a one-day pile-on and more like a preexisting product choice that only became visible once users started comparing Opus 4.6 against 4.5.
Quota burn
[BUG] Pro Max 5x Quota Exhausted in 1.5 Hours Despite Moderate Usage · Issue #45756 · anthropics/claude-code
577 upvotes · 524 comments
Quality complaints and cost complaints are now tangled together. In issue #45756, a Pro Max 5x subscriber said moderate Claude Code usage burned through quota in about 1.5 hours, with estimated effective usage near 8.7 million tokens per hour if cache_read is charged at one tenth rate.
Anthropic's help center article on Pro and Max plans confirms that Claude and Claude Code draw from the same subscription bucket. So when users say a bad coding session wastes limit fast, they are talking about the same capacity they would otherwise spend in the app.
Pro Max 5x quota exhausted in 1.5 hours despite moderate usage
577 upvotes · 524 comments
The community discussion has become half bug triage, half market signal. one HN commenter said rolling back to Claude Code 2.1.34 eased quota burn, while a fresh delta on the older HN thread notes at least one power user had already moved to Codex.
Cache TTL
A separate report adds one more concrete mechanism for rising bills. Issue #46829 claims Claude Code's prompt cache TTL silently regressed from one hour to five minutes in early March, based on JSONL logs spanning January 11 through April 11.
That issue was closed as not planned, but its timeline overlaps with the broader regression window. It also matches the references in the quota bug thread to duplicate reports about cache accounting and TTL behavior, which is why the current Claude Code debate feels less like one benchmark fight and more like several small systems drifting at once.
🧾 More sources
BridgeBench2 tweetsBenchmark retest plus user reactions comparing Opus 4.6 against 4.5.
Fresh discussion on Issue: Claude Code is unusable for complex engineering tasks with Feb updates
1.4k upvotes · 753 comments