updateMarch 14, 2026

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

Third-party MRCR v2 results put Claude Opus 4.6 at a 78.3% match ratio at 1M tokens, ahead of Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. If you are testing long-context agents, measure retrieval quality and task completion, not just advertised context window size.

Claude Code Claude Agent Readiness Benchmarks

3 min read

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

TL;DR

Anthropic's 1M rollout post says Claude Opus 4.6 and Sonnet 4.6 now have a generally available 1M-token context window, and the same post adds support for up to 600 images in a prompt.
The MRCR chart puts Claude Opus 4.6 at 78.3% mean match ratio on MRCR v2 at 1M tokens, ahead of Sonnet 4.6 at 65.1%, GPT-5.4 at 36.6%, and Gemini 3.1 Pro at 25.9%.
Practitioner screenshots in CLI comparison and Claude Code screenshot suggest the rollout is already visible in Claude Code, where Opus 4.6 shows up as a 1M-context model on high paid tiers.
The early engineering takeaway from retrieval chart and follow-up reply is that long-context capacity only matters if retrieval still works at length; the benchmark is more informative than the headline token limit alone.

What shipped

Anthropic has moved 1M-token context from preview status into general availability for Claude Opus 4.6 and Claude Sonnet 4.6. In the same announcement thread, the rollout post says the models can ingest up to 600 images, which expands the practical input budget beyond text-heavy agent runs.

The rollout is already showing up in developer tooling. In a Claude Code screenshot, Opus 4.6 appears as "Opus 4.6 (1M context) · Claude Max," while a supporting roundup also describes the 1M window as generally available for both 4.6 models. That matters operationally because it turns long-context testing into something engineers can actually run inside coding workflows rather than a limited-access benchmark claim.

How strong is the 1M context in practice

The strongest signal in the evidence is not the 1M number but the retrieval curve. The MRCR v2 chart shows Opus 4.6 at 91.9% mean match ratio at 256K tokens and 78.3% at 1M, while Sonnet 4.6 drops to 65.1%, GPT-5.4 to 36.6%, and Gemini 3.1 Pro to 25.9%. The same chart notes Gemini's numbers were measured by Context Arena on the same benchmark rather than using vendor self-report, and OpenAI's score is a "bin average" across 128K-256K rather than a single fixed context length.

That distinction is why retrieval quality matters more than raw advertised window size. As one reply puts it, "If you can't get accurate retrieval," the benefit of a larger window is limited. The supporting comparison post also calls out GPT-5.4 as a "regression" on long-context behavior relative to an earlier OpenAI chart at 256K, reinforcing that context-window expansion does not automatically preserve recall.

What engineers are seeing in coding workflows

The most concrete field report here is a side-by-side CLI comparison. In that screenshot, Gemini CLI warns that sending a message "might exceed the context window limit," while Claude Code continues processing with Opus 4.6 at 1M context. The author claims Claude handled a larger project prompt that Gemini would not submit, which is anecdotal, but the screenshot does show a real difference in tool behavior under large inputs.

For code agents, that changes the failure mode. Instead of deciding which repo slice, docs subset, or prior conversation chunk to drop, teams can keep more of the working set in one session. Another practitioner post frames it as "entire codebases" and "complete conversation history" staying in context for complex refactors. Independent eval work is still catching up, though: one evaluator said his team hit "a couple speed bumps" while trying to get Opus 1M working in their own test setup.

🧾 More sources

TL;DR1 tweets

Top-line facts: GA rollout, benchmark ranking, tooling visibility, and the practical importance of retrieval quality.

What shipped1 tweets

Covers the product change itself and where engineers can see it in practice.

How strong is the 1M context in practice1 tweets

Groups the benchmark evidence and the caveat that token capacity without retrieval quality is less useful.

What engineers are seeing in coding workflows1 tweets

Field reports from coding tools and eval attempts that show how the larger window affects actual development workflows.

updateMarch 14, 2026

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

Claude Code Claude Agent Readiness Benchmarks

3 min read

TL;DR

Anthropic's 1M rollout post says Claude Opus 4.6 and Sonnet 4.6 now have a generally available 1M-token context window, and the same post adds support for up to 600 images in a prompt.
The MRCR chart puts Claude Opus 4.6 at 78.3% mean match ratio on MRCR v2 at 1M tokens, ahead of Sonnet 4.6 at 65.1%, GPT-5.4 at 36.6%, and Gemini 3.1 Pro at 25.9%.
Practitioner screenshots in CLI comparison and Claude Code screenshot suggest the rollout is already visible in Claude Code, where Opus 4.6 shows up as a 1M-context model on high paid tiers.
The early engineering takeaway from retrieval chart and follow-up reply is that long-context capacity only matters if retrieval still works at length; the benchmark is more informative than the headline token limit alone.

What shipped

Kol Tregaskes

@koltregaskes

·Follow

This is huge, literally. 1M context is now generally available for Claude Sonnet/Opus 4.6 and it can ingest 600 images. So it can "watch" a video - take a video and split out the frames then upload. Not sure how big the images can be, probably small to leave room for a convo. Show more

Claude

@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

12:30 PM · Mar 14, 2026

Read 4 replies

BridgeMind

@bridgemindai

·Follow

Claude Opus 4.6 with 1M context in Claude Code is a massive upgrade. 5x the context window we had before yesterday's update. Entire codebases. Full documentation. Complete conversation history. All in context. No more truncation. No more lost context mid-session. This changes Show more

12:25 PM · Mar 14, 2026

Read 6 replies

How strong is the 1M context in practice

Chubby♨️

@kimmonismus

·Follow

Anthropic consistently manages to impress, particularly with its ability to maintain attention and reasoning across extended contexts. Even outperforming ex-context-king Google, by a lot.

3:17 PM · Mar 14, 2026

352

Read 22 replies

Dan McAteer

@daniel_mac8

·Follow

Opus 4.6 is by far the best 1 million token context model out there at 78% accuracy at 1 mil toks. The only model that’s even close is Sonnet 4.6. GPT-5.4 is also a regression in long context compared to GPT-5.2 at 256k.

Claude

@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

1:28 AM · Mar 15, 2026

137

Read 13 replies

What engineers are seeing in coding workflows

BridgeMind

@bridgemindai

·Follow

Claude Opus 4.6 with 1M context is a game changer. I dropped in 2.1 million tokens into Gemini CLI. It refused to submit.Same prompt into Claude. No problem. Already working. That's the difference. Claude just handles it.

8:23 PM · Mar 14, 2026

Read 4 replies

🧾 More sources

TL;DR1 tweets

Top-line facts: GA rollout, benchmark ranking, tooling visibility, and the practical importance of retrieval quality.

What shipped1 tweets

Covers the product change itself and where engineers can see it in practice.

How strong is the 1M context in practice1 tweets

Groups the benchmark evidence and the caveat that token capacity without retrieval quality is less useful.

What engineers are seeing in coding workflows1 tweets

Field reports from coding tools and eval attempts that show how the larger window affects actual development workflows.