updateMarch 14, 2026

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

Third-party MRCR v2 results put Claude Opus 4.6 at a 78.3% match ratio at 1M tokens, ahead of Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. If you are testing long-context agents, measure retrieval quality and task completion, not just advertised context window size.

3 min read

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

TL;DR

Anthropic's 1M rollout post says Claude Opus 4.6 and Sonnet 4.6 now have a generally available 1M-token context window, and the same post adds support for up to 600 images in a prompt.
The MRCR chart puts Claude Opus 4.6 at 78.3% mean match ratio on MRCR v2 at 1M tokens, ahead of Sonnet 4.6 at 65.1%, GPT-5.4 at 36.6%, and Gemini 3.1 Pro at 25.9%.
Practitioner screenshots in CLI comparison and Claude Code screenshot suggest the rollout is already visible in Claude Code, where Opus 4.6 shows up as a 1M-context model on high paid tiers.
The early engineering takeaway from retrieval chart and follow-up reply is that long-context capacity only matters if retrieval still works at length; the benchmark is more informative than the headline token limit alone.

What shipped

Anthropic has moved 1M-token context from preview status into general availability for Claude Opus 4.6 and Claude Sonnet 4.6. In the same announcement thread, the rollout post says the models can ingest up to 600 images, which expands the practical input budget beyond text-heavy agent runs.

The rollout is already showing up in developer tooling. In a Claude Code screenshot, Opus 4.6 appears as "Opus 4.6 (1M context) · Claude Max," while a supporting roundup also describes the 1M window as generally available for both 4.6 models. That matters operationally because it turns long-context testing into something engineers can actually run inside coding workflows rather than a limited-access benchmark claim.

How strong is the 1M context in practice

The strongest signal in the evidence is not the 1M number but the retrieval curve. The MRCR v2 chart shows Opus 4.6 at 91.9% mean match ratio at 256K tokens and 78.3% at 1M, while Sonnet 4.6 drops to 65.1%, GPT-5.4 to 36.6%, and Gemini 3.1 Pro to 25.9%. The same chart notes Gemini's numbers were measured by Context Arena on the same benchmark rather than using vendor self-report, and OpenAI's score is a "bin average" across 128K-256K rather than a single fixed context length.

That distinction is why retrieval quality matters more than raw advertised window size. As one reply puts it, "If you can't get accurate retrieval," the benefit of a larger window is limited. The supporting comparison post also calls out GPT-5.4 as a "regression" on long-context behavior relative to an earlier OpenAI chart at 256K, reinforcing that context-window expansion does not automatically preserve recall.

What engineers are seeing in coding workflows

The most concrete field report here is a side-by-side CLI comparison. In that screenshot, Gemini CLI warns that sending a message "might exceed the context window limit," while Claude Code continues processing with Opus 4.6 at 1M context. The author claims Claude handled a larger project prompt that Gemini would not submit, which is anecdotal, but the screenshot does show a real difference in tool behavior under large inputs.

For code agents, that changes the failure mode. Instead of deciding which repo slice, docs subset, or prior conversation chunk to drop, teams can keep more of the working set in one session. Another practitioner post frames it as "entire codebases" and "complete conversation history" staying in context for complex refactors. Independent eval work is still catching up, though: one evaluator said his team hit "a couple speed bumps" while trying to get Opus 1M working in their own test setup.

TL;DR

What shipped

How strong is the 1M context in practice

What engineers are seeing in coding workflows

Discussion across the web