breakingMarch 13, 2026

Vals benchmarks Grok 4.20 Beta: ProofBench rises to 14% while legal tasks regress

Vals published a benchmark pass for Grok 4.20 Beta showing gains on coding, math, multimodal, and Terminal Bench 2, alongside weaker legal-task results. Check task-level results before adopting it, especially if legal workflows matter more than headline benchmark gains.

3 min read

Vals benchmarks Grok 4.20 Beta: ProofBench rises to 14% while legal tasks regress

TL;DR

Vals says Grok 4.20 Beta (Reasoning) is an overall step up from Grok 4.1 Fast (Reasoning), with gains across coding, math, and multimodal evals in its latest benchmark pass Vals overview.
The biggest task-level jump Vals called out was on ProofBench, where Grok 4.20 Beta reached 14% versus 4% for Grok 4.1 Fast, while coding-oriented suites including LiveCodeBench, SWE-Bench, Terminal Bench 2, and Vibe Code Bench also improved coding gains ProofBench jump.
Multimodal results also moved up: Vals reports 83.47% and a #9 rank on MMMU, versus #31 for Grok 4.1 Fast, alongside improvement on SAGE for grading handwritten work multimodal gains.
The same run showed weaker legal performance and beta-release caveats: Vals ranked the model #30 on CaseLaw and #62 on LegalBench, and said the snapshot may change in later iterations legal regressions beta caveat.

Where did Grok 4.20 actually improve?

Vals' benchmark pass says Grok 4.20 Beta improved over Grok 4.1 Fast on a broad set of engineering-relevant tests, including AIME, GPQA, IOI, LiveCodeBench, SWE-Bench, and Terminal Bench 2 coding gains. The same thread says Vibe Code Bench rose from "1% to ~4%," a small absolute number but still a measurable gain on a test for building web apps from scratch coding gains.

The clearest single delta was ProofBench. Vals says the model "jumps up to 14%" from 4% on a benchmark for formally verified graduate-level math proofs ProofBench jump. On the multimodal side, Vals reports 83.47% and a #9 MMMU ranking, up from #31 for Grok 4.1 Fast, plus improvement on SAGE for grading handwritten solutions multimodal gains.

What are the deployment tradeoffs in this snapshot?

Vals says the model is "generally quite fast," with per-test latency on the Vals Index at roughly 20% to 50% of higher-performing models latency note. The benchmark card in Vals' post lists 85.58s latency, $0.28 cost per test, and a 2M context window and max output size for this snapshot benchmark card.

The same evaluation thread also flags two reasons not to overread the headline gains. First, legal-task performance lagged: Vals reports rankings of #30 on CaseLaw and #62 on LegalBench, calling out "less improvement or even regressions" on legal benchmarks legal regressions. Second, these numbers come from a beta snapshot run at temperature 0.7 and top_p 0.95 through the xAI API endpoint grok-4.20-beta-0309-reasoning, and Vals says performance may change in later releases eval settings beta caveat.

TL;DR

Where did Grok 4.20 actually improve?

What are the deployment tradeoffs in this snapshot?

Discussion across the web