Claude Opus 4.8 breaks modified-thinking flows with 400 errors
Fresh Hacker News comments and user posts report 400 errors tied to modified thinking blocks, short cache TTLs, and behavior some testers call a regression from 4.7. The issue matters because the same model is also powering one-prompt sites and games, so quota burn and client breakage are showing up alongside stronger creative output.

TL;DR
- Anthropic shipped Claude Opus 4.8 into Claude Code and the API at the same price as 4.7, and Boris Cherny's launch note bcherny's launch post framed the headline gain as SWE-bench Pro moving from 64.3% to 69.2%, while ClaudeDevs' thread positioned 4.8 as the default Claude Code model.
- Fresh breakage reports in the main HN thread and the fresh discussion summary center on
thinkingandredacted_thinkingblocks, with commenters quoting a 400 error that says those blocks in the latest assistant message "cannot be modified." - Cost and latency tuning became part of the release story almost immediately, because ClaudeDevs' caching post advertised mid-conversation system instructions without breaking cache hits while om_patel5's token-burn thread claimed the prompt cache TTL had been quietly cut from 60 minutes to 5 minutes.
- The same rollout also added dynamic workflows, where ClaudeDevs' dynamic workflow post says Claude writes an orchestration script and fans work out to parallel subagents, a setup that bcherny's usage note says is token-intensive and meant for migrations, refactors, and batch fixes.
- Early hands-on reports split fast: petergyang's comparison found regular Claude better for writing than Claude Code, while danshipper's week of testing called 4.8 an unusually strong writer and knowledge-work model when run at higher reasoning levels.
You can read Anthropic's official Opus 4.8 announcement, the Claude Code workflows docs, the dynamic workflows launch post, and the mid-conversation system message docs. The weird part is how fast the story split in two: the HN thread filled up with 400 errors and regression complaints, while X filled with one-prompt websites, canvas tests, and people arguing that the same model feels different depending on which Anthropic surface wraps it.
What shipped
Anthropic's launch framing was compact. Opus 4.8 landed in Claude Code on day one, and ClaudeDevs' availability post says it was also available on Max, Team, Enterprise, and through the API, including Bedrock, Vertex AI, and Foundry.
The concrete changes attached to the launch were:
- SWE-bench Pro: 64.3% to 69.2%, according to bcherny's launch post
- same listed price as 4.7, per the same launch post
- more willingness to admit uncertainty and catch its own bugs, according to _catwu's recommendation
- updated Claude Code migration guidance, with
/claude-api migratenow suggesting 4.8-tuned model strings and prompt changes, per ClaudeDevs' migration note
That honesty pitch was not a side detail. aakashgupta's read of Anthropic's chart argued the company left a losing benchmark row visible on purpose because the product claim was reliability about progress, not just a prettier scorecard.
Thinking blocks
Claude Opus 4.8
1.7k upvotes · 1.4k comments
Fresh discussion on Claude Opus 4.8
1.7k upvotes · 1.4k comments
The hardest concrete failure showing up in public discussion is the 400 error around modified reasoning traces. In the HN thread and the discussion extract, commenters quote the API error exactly: messages.3.content.56: thinking or redacted_thinking blocks in the latest assistant message cannot be modified.
That matters because 4.8 also shipped new prompt-state mechanics. ClaudeDevs' product note says you can insert a system-role message mid-conversation and have Claude treat it as authoritative from that point forward, while their paired caching post says those updates can keep hitting the prompt cache instead of invalidating it.
So the release combined two forces at once: more structured conversation state, and user reports that some existing client flows were now hitting strict validation around thinking blocks. The fresh HN summary says the new comments were heavier on breakage and regression reports than on new feature discovery.
Prompt cache
The cost story got more interesting than the launch copy. Anthropic's own developer posts pushed automatic caching docs and mid-conversation system-message docs, both aimed at keeping long sessions cheaper and faster.
Then om_patel5's thread about 1.15 billion input tokens surfaced a very different practitioner lesson set:
- prompt cache TTL was "quietly changed" from 60 minutes to 5 minutes
- low effort on 4.8 could replace higher effort settings used on 4.7
- JSON burns more tokens than plain text or markdown tables
- long chats compound because every turn resends history
- codebase-wide context is a budget killer unless you aggressively trim files
The fastest public joke about 4.8 was quota pain. markproduct's quota gag put it as "said hello" and lost 3% of the month, which is a joke but lines up with the same public theme: people were noticing token economics before they had a settled opinion on the model itself.
Dynamic workflows
The other half of the launch was a bigger harness around the model. Anthropic calls it dynamic workflows, and ClaudeDevs' description says Claude writes an orchestration script on the fly, then coordinates a large fleet of subagents in parallel.
Public descriptions of the feature converged on four mechanics:
- Mention "workflow" in the prompt, per ClaudeDevs' trigger description
- Claude writes a step-by-step orchestration plan, according to _catwu's explanation
- The system fans tasks out across many parallel agents, per _catwu's launch follow-up
- Claude verifies results before returning them, according to minchoi's demo summary
Anthropic's own team was blunt about the tradeoff. bcherny's usage note says dynamic workflows are token-intensive and best saved for migrations, refactors, performance work, and batch bug fixes. The launch blog post makes the same pitch with bigger numbers, and _catwu's Bun example claims Jarred Sumner used the system on a roughly 750,000 line Zig-to-Rust port with 99.8% of the test suite passing after 11 days.
That makes the 400-error complaints more consequential, not less. Opus 4.8 is not only a chat model refresh. It is also the planner inside a more stateful, more expensive, more orchestration-heavy coding surface.
Creative output
The creative angle on 4.8 is messier than the coding pitch, and more interesting. petergyang's comparison says the same model produced better writing in regular Claude than in Claude Code, likely because the surrounding default prompt is less coding-optimized.
At the same time, multiple public demos showed 4.8 being used for fast visual and frontend generation rather than debugging loops:
- viktoroddy's demo framed a one-prompt build as something anyone could reproduce with a shared prompt
- markproduct's pricing-section demo asked why teams were paying hundreds for work Claude could do in one prompt
- stevibe's side-by-side canvas test compared 4.7 and 4.8 directly on visual output
- minchoi's post said 4.8 passed the car-wash and strawberry tests
Some of the strongest praise also came from people testing outside the narrow coding benchmark frame. danshipper's thread says 4.8 scored well on Every's writing and deck-generation tasks, while trq212's reaction highlighted a "warm and collaborative" style and said workflows were the sticky part.
The counter-signal arrived just as fast. zoink's critique called 4.8 more judgmental, more hedged, and less curious than earlier Opus versions, and MishaTeplitskiy's repost pointed to odd output artifacts, including a random Russian word in one reply. For creative users, that split is probably the clearest early read on Opus 4.8: the model is producing stronger artifacts for some people, while the surrounding harness, token budget, and personality shifts are affecting the experience at least as much as the base capability bump.
Benchmarks and cutoffs
A last wrinkle came from people reading past the launch bullets. Everlier's model-card thread claimed the model card dropped MRCR, showed VendingBench numbers down substantially, reported GPQA Diamond lower, and said 4.8 needed about 30% more tool calls to reach only marginally better scores than 4.6 on one comparison.
Those claims sit next to a more ordinary but useful update: stevibe's cutoff-date test pegged Opus 4.8's knowledge cutoff around 2025-10-10 to 2025-10-14, later than the 2025-07-25 window they reported for 4.7.
That leaves the public picture of 4.8 in a very specific place. The official release sold a more honest, slightly stronger Opus with better Claude Code defaults. The community immediately stress-tested the token bill, the workflow harness, the writing surface, the benchmark omissions, and the new failure mode around modified thinking blocks.