updateJune 12, 2026

Users report GPT-5.5, Opus 4.7, and DeepSeek V4 limits on xhigh and 1M context

HN threads added rollout details: Opus 4.7 defaults to xhigh, DeepSeek V4-Pro hits rate limits and reasoning_content errors, and GPT-5.5 API access lagged launch. Use those details to tune token spend, harnesses, and model choice for 1M-context production work.

5 min read

Users report GPT-5.5, Opus 4.7, and DeepSeek V4 limits on xhigh and 1M context

TL;DR

the Opus 4.7 HN discussion says Claude Code now defaults to xhigh, while Anthropic's launch post says Opus 4.7 also ships a new tokenizer and more literal instruction following, so the model swap can change both token spend and harness behavior.
the GPT-5.5 HN discussion caught that API access lagged the product launch, and OpenAI's launch post was later updated on April 24 to say GPT-5.5 and GPT-5.5 Pro were live in the API.
the DeepSeek V4 discussion describes Flash as the model people could actually run, while the main HN thread on DeepSeek V4 says Pro was slower, more rate-limited, and harder to evaluate cleanly.
DeepSeek's launch summary makes 1M tokens the headline for both V4 models, Anthropic's Opus 4.7 summary keeps pricing flat while warning about higher token counts, and OpenAI's GPT-5.5 summary pairs its own 1M context push with a staged API rollout instead of day-one full exposure.

OpenAI slipped an April 24 update into the GPT-5.5 post to confirm API access after launch-day confusion. Anthropic's own release note says older prompts can misfire because Opus 4.7 follows instructions more literally. DeepSeek's preview note gives both V4 models 1M context, but the HN thread and a Hugging Face technical writeup put the real spotlight on serving efficiency and on Flash being the practical choice.

GPT-5.5 API timing

Introducing GPT-5.5: OpenAI's New Frontier Model for Professional Work

OpenAI released GPT-5.5 on April 23, 2026, as its most capable and intuitive model to date, designed for complex professional tasks like coding, research, and data analysis. The release includes GPT-5.5 and GPT-5.5 Pro, both backed by OpenAI's most rigorous safety evaluations and red-teaming protocols to date, particularly regarding cybersecurity and biology capabilities. The model features a 1 million token context window and was launched with immediate availability for ChatGPT Plus, Pro, Business, and Enterprise users. API availability was scheduled to follow shortly after the announcement.

OpenAI launched GPT-5.5 to paying ChatGPT users and Codex first, not to the API. the HN discussion called out that gap immediately, and OpenAI's launch post later added an April 24 note saying both GPT-5.5 and GPT-5.5 Pro were now available in the API.

That one-day correction mattered because the product surface and the developer surface were not the same launch. The same post that promised 1M context also said API deployments needed different safeguards, and OpenAI's system card was updated the next day with extra API-specific guardrail detail.

Discussion around GPT-5.5

Thread discussion highlights: - minimaxir on throughput and token efficiency: Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work... increasing token generation speeds by over 20%. - simonw on API access and Codex backdoor: This doesn't have API access yet... and that backdoor API has GPT-5.5. - 6thbit on benchmark comparison: Still far from Mythos on SWE-bench but quite comparable otherwise.

HN commenters also fixated on exposure paths. the main GPT-5.5 thread notes that Codex already had GPT-5.5 before the public API did, alongside reports of tighter usage limits and higher prices than earlier GPT-5.x releases.

Opus 4.7 xhigh

Anthropic Announces Claude Opus 4.7 with Enhanced Engineering and Agentic Capabilities

Anthropic announced the general availability of Claude Opus 4.7 on April 16, 2026. The model offers improvements in advanced software engineering, complex multi-step tasks, and instruction following compared to Opus 4.6. Key updates include an updated tokenizer, a new "xhigh" effort level for finer control over reasoning-latency tradeoffs, and enhanced self-verification capabilities. It is available across Claude products, the API, and major cloud platforms, maintaining previous pricing but noting potential token usage increases due to the new tokenizer and increased reasoning at high effort levels.

Anthropic positioned Opus 4.7 as a straightforward GA ship: same list price as 4.6, broader availability across Claude, API, Bedrock, Vertex AI, and Foundry, plus a new xhigh effort tier in the middle of the latency curve. The buried caveat is in the same launch post: the updated tokenizer can increase token usage, and older prompts may break because 4.7 follows instructions more literally.

Discussion around Claude Opus 4.7

Thread discussion highlights: - sallymander on stricter behavior on coding tasks: reports that 4.7 feels “a little more fussy” than 4.6 and refuses to modify code even when given Anthropic’s own Agentic SDK quick-start task - jimmypk on default effort and token spend: notes that Claude Code now defaults to `xhigh` and that the tokenizer overhead may push real token usage above 4.6-era estimates - simonw on adaptive thinking / summarized reasoning: says the new adaptive-thinking behavior is confusing and that human-readable reasoning summaries now require `display: "summarized"`

HN reports make that caveat less theoretical:

the discussion summary says Claude Code now defaults to xhigh, which changes the out-of-the-box cost and latency profile.
the main HN thread on Opus 4.7 surfaces complaints that 4.7 feels fussier on coding tasks and can refuse edits that 4.6 would make.
the same discussion says human-readable reasoning now needs display: "summarized", another small API surface change hidden behind a version bump.
the HN thread also quotes Anthropic's own note that prompt and harness retuning may be required.

The result is a stronger but less forgiving model. Opus 4.7 did not ship as a drop-in replacement, even though the price card barely moved.

DeepSeek V4 Flash

DeepSeek Announces V4 Preview Release with 1M-Token Context Efficiency

On April 24, 2026, DeepSeek announced the preview release and open-source availability of its DeepSeek-V4 model series. The release introduces two primary models: DeepSeek-V4-Pro (1.6T total parameters, 49B active) for flagship reasoning and agentic tasks, and DeepSeek-V4-Flash (284B total parameters, 13B active) for cost-effective performance. Both models feature a 1M-token context window as the new default across official services, enabled by a novel attention mechanism combining token-wise compression and DeepSeek Sparse Attention (DSA). The API is immediately available via existing base URLs by updating model parameters to deepseek-v4-pro or deepseek-v4-flash. Legacy models deepseek-chat and deepseek-reasoner will be retired after July 24, 2026.

DeepSeek shipped two very different V4 stories under one 1M-context banner. DeepSeek's preview note introduces V4-Pro at 1.6T total parameters with 49B active, and V4-Flash at 284B total with 13B active, with both exposed through new deepseek-v4-pro and deepseek-v4-flash model names. The same note also sets a retirement date, July 24, 2026, for deepseek-chat and deepseek-reasoner.

Discussion around DeepSeek v4

Thread discussion highlights: - rvz on paper / architecture: Highlights the technical report’s focus on manifold-constrained hyper-connections and the hybrid attention mechanism, and advises waiting for independent testing before trusting benchmarks. - simonw on user output comparison: Reports that the Flash model produced a better pelican image than Pro when tested through OpenRouter. - XCSme on third-party benchmarking and rate limits: Says official results look strong, but third-party benchmarks and personal tests do not place V4 at the top, and notes Pro is heavily rate-limited and prone to timeouts.

The community reaction was much less interested in the naming than in which model actually held up under load:

the main HN thread describes Flash as the practical option so far.
the discussion summary says Pro was heavily rate-limited and prone to timeouts.
the same thread includes a report that Pro needed reasoning_content to avoid API errors in at least one benchmark harness.
the discussion summary also notes that personal tests and third-party benchmarks did not clearly place V4 at the top, despite strong official numbers.

Flash getting the better early reputation is the weirdest part of the launch. Even HN commenters reported cases where Flash looked better than Pro on actual outputs, which is not how flagship-versus-cheap-tier rollouts usually read on day one.

Million-token serving

DeepSeek v4

Useful as a snapshot of an open-weight frontier-model release with unusually long context and a lot of attention on serving constraints. The most actionable signal is that Flash seems to be the practical option so far, while Pro is discussed as slower, rate-limited, and harder to evaluate reliably.

All three launches used 1M context as a headline spec, but the rollout details point to a more practical dividing line: serving and harness constraints. the DeepSeek V4 thread is unusually explicit that Flash felt usable while Pro hit rate limits; the Opus 4.7 thread warns that xhigh defaults and tokenizer changes can raise effective cost; and the GPT-5.5 thread centers on access paths, throughput tuning, and usage caps more than on raw context length.

The vendor materials back that up with different kinds of caveats:

Anthropic's launch post says the new tokenizer can map the same input to more tokens and that prompt behavior may change.
OpenAI's launch post says API serving required extra safeguards, while OpenAI's system card was updated a day later for the API rollout.
DeepSeek's preview note sells cost-effective 1M context, and Hugging Face's technical writeup argues the real novelty is the efficiency stack behind it, not the raw window size.

One concrete data point came from the GPT-5.5 thread: the HN summary quotes Codex as using traffic-shaped heuristic partitioning to lift token generation speed by more than 20%. That is a cleaner clue than any context-window slogan. In this batch of releases, the interesting story was not who printed 1M first, it was who could keep long runs affordable, available, and stable enough to matter.