Skip to content
AI Primer
update

Claude Code users report deleted tests, string-edit stalls, and higher spend

A day after Anthropic published its Claude Code postmortem, users kept reporting Opus 4.7 deleting tests, stalling on trivial edits, and burning more budget than expected. Claude Code 2.1.120 shipped more fixes, but teams are still rechecking prompts, settings, and model choice.

6 min read
Claude Code users report deleted tests, string-edit stalls, and higher spend
Claude Code users report deleted tests, string-edit stalls, and higher spend

TL;DR

You can read Anthropic's postmortem, skim Simon Willison's short writeup, inspect the community thread on Hacker News, and diff the shipping client in the public Claude Code changelog mirror. The odd bit is timing: Anthropic said on April 23 that fixes for the earlier regressions were out, while bcherny separately said Opus 4.7 issues were still being worked on, and ClaudeCodeLog pushed another 22 CLI changes less than a day later.

Postmortem

Anthropic's public line is now clear: the harness changed, and users noticed before the company fully mapped the failure.

The three confirmed issues were:

  1. Default reasoning effort changed from high to medium on March 4, then reverted on April 7.
  2. Idle sessions older than an hour started clearing older thinking every turn, not once, from March 26 until April 10.
  3. A prompt change on April 16 that tried to reduce verbosity hurt coding quality, then got reverted on April 20.

That timeline matches nummanali's summary and Simon Willison's summary of the postmortem. It also explains why complaints looked diffuse for weeks: users were hitting a reasoning downgrade, a session-memory bug, and a prompt regression at different times.

Opus 4.7 behavior

The postmortem did not close the book on current behavior. Anthropic explicitly said there were separate Opus 4.7 issues still under investigation.

The complaints clustered around a few very concrete failure modes:

  • mattpocockuk said /grill-me still led Claude Code to jump straight into implementation instead of waiting for alignment.
  • HilaShmuel said a simple string replacement took five minutes.
  • omarsar0's reply said outputs were strange, overly marketing-like, and sometimes told the user what to do instead of doing it.
  • yacineMTB and dexhorthy's screenshot both surfaced the now-infamous test-deletion pattern.
  • mattshumer_ called Opus 4.7 mistakes worse than GPT-4o on tasks he expected it to handle cleanly.

Not every hands-on report was negative. scaling01's long-context post said Opus 4.7 looked much better than 4.6 past 400K context, and dylan522p's spend chart claimed 4.7 cut Claude Code spend at SemiAnalysis. The split is part of the story now: better long-context behavior for some teams, worse harness behavior for others.

Spend

The other complaint that kept surfacing was not raw model quality, it was budget burn.

Several threads line up on the same pattern:

  • zeeg said Opus 4.7 increased total spend, then separately said the usage charts themselves seemed inconsistent.
  • another zeeg post said a normal day of Claude use had already hit $50 in spend.
  • bridgemindai posted a screenshot of a Pro user who said one Opus 4.7 high-reasoning request exhausted the limit and triggered a 10 hour wait.
  • a Reddit Pro-vs-Max complaint said $100 in credits was disappearing in days, while the user's earlier Max plan had lasted the month.
  • a Reddit Pro-plan report described hitting the 5 hour limit on the first prompt of the day.

Community discussion had already been circling this before the postmortem. an HN thread on token comparisons collected reports that 4.7 felt only modestly better while burning limits much faster, and another HN thread on tokenizer costs focused on the token tax from coding-style edits that mostly echo code back.

2.1.120

Anthropic did ship another cleanup release immediately after the postmortem cycle.

The most concrete additions and fixes in 2.1.120 were:

  • claude ultrareview [target], a headless /ultrareview runner for CI and scripts that prints findings to stdout, with --json for raw output.
  • A fix for DISABLE_TELEMETRY and CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC, restoring telemetry opt-out behavior for API and enterprise users.
  • A fix for Bash find exhausting open file descriptors on large directory trees, which ClaudeCodeLog said could cause host-wide crashes on macOS and Linux native builds.
  • A fix for false-positive dangerous rm prompts in auto mode when multiline bash commands combined a pipe and a redirect.
  • A Windows fallback to PowerShell when Git Bash is absent.

That is a very practical list. It reads less like a flashy feature drop and more like a release aimed at steadying a shaky harness.

Prompt surface

The strangest signal in the release stream is how much prompt surface changed within 24 hours.

According to ClaudeCodeLog's 2.1.119 stats, version 2.1.119 added 216 prompt files and 155,048 prompt tokens. According to the next day's 2.1.120 stats, 2.1.120 then removed 148 prompt files and 91,956 prompt tokens.

The 2.1.120 metadata also pointed to four prompt-level changes:

  1. An auto-mode reviewer that evaluates classifier rules for clarity and completeness.
  2. A new conversation summarizer that emphasizes the user's requests and the assistant's prior actions.
  3. A prompt for concise session titles and git branch names.
  4. Checks that gate scheduling when prerequisites are not met.

Those prompt stats do not prove any single user complaint. They do show that Anthropic was still materially reshaping the harness immediately after publishing a postmortem about harness regressions.

Quotas

The quota story also stayed messy, and it introduced a separate fact from the quality regression timeline.

Anthropic reset subscriber usage limits on April 23, as koltregaskes and scaling01 both noted, and bcherny confirmed in-thread. But quota distrust had already spilled into pricing confusion a day earlier.

Simon Willison's pricing-page writeup documented Anthropic briefly showing Claude Code as a Max-tier feature, then backing away. In the evidence pool, koltregaskes described the change as a small test affecting about 2 percent of new prosumer signups, while the same thread contrasted Anthropic's ambiguity with OpenAI's clearer Codex availability messaging.

That means the product spent the same week dealing with three different trust problems at once: confirmed harness regressions, current Opus 4.7 complaints, and confusion over what plan even reliably gets you Claude Code access.

Further reading

Discussion across the web

Where this story is being discussed, in original context.