Skip to content
AI Primer
update

Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts

Chandra's developer said Mistral OCR 4 launch numbers for both Chandra and OCR 4 could not be reproduced with public code, and published scripts to show the gaps. The dispute matters because Mistral OCR 4 launched on leaderboard claims, and benchmark settings now directly affect model selection.

5 min read
Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts
Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts

TL;DR

  • VikParuchuri's thread says Mistral OCR 4's launch comparison understated Chandra OCR 2 and left out Infinity Parser, which he says scores 87.6% on olmOCR.
  • According to VikParuchuri's reproduction post, public Chandra code lands at about 84.7% on olmOCR and rises to 85.8% with formatting cleanup and high DPI, while his API test put Mistral OCR 4 at 82.9% with stock settings versus Mistral's claimed 85.2%.
  • VikParuchuri responded by publishing a repro script, and a follow-up correction says Chandra also lowered its own published score from 85.9% to a reproducible 85.8% after finding a normalization mismatch.
  • Community replies from bclavie and VikParuchuri's later comment pushed on a second issue, a reportedly modified benchmark ground truth that has not been released publicly.
  • A separate benchmark from Jerry Liu's ParseBench post painted a different picture: Mistral OCR 4 looked strong on semantic formatting and cheap at $0.004 per page, but trailed several models overall until a later annotation-enabled rerun lifted its score closer to the frontier pack.

You can trace the dispute from VikParuchuri's link roundup, which points to the Chandra repo, Hugging Face leaderboard, and Mistral launch post. The weird part is that the argument is not just about one number: the bar chart with and without post-processing suggests both Chandra and Mistral move materially with settings, while Jerry Liu's updated ParseBench chart shows Mistral's chart score jumping once annotations are enabled.

The reproducibility gap

The core claim is simple. Vik Paruchuri, creator of Chandra OCR 2, said Mistral's OCR 4 launch numbers for both Chandra and Mistral OCR 4 could not be reproduced from public artifacts.

His breakdown separates stock behavior from extra processing:

VikParuchuri's repro-script post says the missing piece on his side was that Chandra had not shipped a turnkey reproduction script at launch. He published one after the dispute surfaced, and his later correction also trimmed Chandra's own previously stated score from 85.9% to 85.8% after finding a text normalization divergence.

The omissions and ground-truth questions

The launch comparison also drew criticism for what it did not show. VikParuchuri said Infinity Parser, which he reports at 87.6% on olmOCR, was omitted from Mistral's comparison table, and bclavie called out the exclusion of two public leaderboard entries that outperformed Mistral.

A second thread of criticism focused on benchmark labels rather than model settings. VikParuchuri's later reply says Mistral told commenters it had modified the benchmark ground truth, and bclavie argued that silently changing a benchmark is extremely uncommon, especially when the change appears to lower a competitor while boosting the reporting model.

No tweet in the evidence pool includes the modified ground truth itself. The public ask from both bclavie and his follow-up reply was to release it.

ParseBench looks different

Jerry Liu, founder of LlamaIndex, posted a separate evaluation on ParseBench that makes the story less one-dimensional. His first run put Mistral OCR 4 at 60.7 overall, below Chandra OCR 2 at 70.1, Gemini 3.1 Pro at 69.1, and GPT-5.5 at 67.8, according to Liu's leaderboard post.

The more useful detail is in the sub-scores shown in

:

  • Semantic formatting: 66.4 for Mistral OCR 4, ahead of Gemini 3.1 Pro's 52.4 and GPT-5.5's 60.1
  • Content faithfulness: 89.6, above Chandra OCR 2's 83.7 and Azure DI's 84.9
  • Visual grounding: 71.2, close to Azure DI's 73.8 and above GPT-5.5's 36.3
  • Tables: 73.9, well behind Gemini 3.1 Pro's 91.0 and GPT-5.5's 90.1
  • Charts: 2.4, effectively absent in the initial run
  • Price: 0.4 cents per page, far below Gemini 3.1 Pro at 8.5 and GPT-5.5 at 13.1

That split helps explain why Mistral could look competitive in one benchmark and underwhelming in another. ParseBench weights chart handling and table extraction explicitly, while Liu's summary in his post says Mistral wins on semantic formatting and stays competitive on faithfulness and grounding.

Repro scripts and score corrections

The most concrete thing that changed after the argument was process. VikParuchuri said future benchmarked model releases from his side will ship with reproduction scripts, and he specifically named Surya OCR 2 as another model that may get the same treatment.

The benchmark picture also moved again. Jerry Liu's updated ParseBench run says enabling Mistral's annotation feature raises its overall score from 60.7 to 68.2, just ahead of GPT-5.5's 67.8 and just behind Gemini 3.1 Pro's 69.1. The chart sub-score in

jumps from 2.4 to 40.1, while the table, faithfulness, formatting, and grounding numbers stay the same.

That late update introduces a narrower claim than the launch-week dispute: even when a model is cheap and capable, benchmark standings can swing hard on whether the evaluator exposes optional features and publishes the exact harness used to score them.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR5 posts
The reproducibility gap3 posts
The omissions and ground-truth questions2 posts
Share on X