releaseMarch 18, 2026

Chandra OCR 2 opens with 85.9 on olmOCR Bench and 90+ language support

Datalab-to open-sourced Chandra OCR 2, a 4B document model with repo, weights, demo, and CLI quickstart, and claims state-of-the-art 85.9 on olmOCR Bench. It gives document pipelines a practical multilingual OCR option that can run with local tooling instead of only hosted APIs.

3 min read

Chandra OCR 2 opens with 85.9 on olmOCR Bench and 90+ language support

TL;DR

Datalab-to open-sourced Chandra OCR 2 as a 4B document OCR model, with the launch post saying it reached “85.9% (sota) on olmocr bench” and added “90+ language support” launch post.
The release includes a distribution post with GitHub, Hugging Face, and demo endpoints, plus a CLI quickstart that installs chandra-ocr, starts chandra_vllm, and runs chandra input.pdf ./output quickstart.
According to the linked GitHub repo and HF weights, Chandra OCR 2 supports local inference through Transformers or a vLLM server and returns layout-aware outputs in formats including Markdown, HTML, and JSON repo links.
The model ships with known caveats: Datalab says leading line numbers can be copied verbatim and other extraction errors are still being worked through limitations.

What shipped for document pipelines?

Chandra OCR 2 is available as open-source weights, code, and a hosted demo rather than only as a managed OCR API. The release thread points users to a GitHub repo and HF weights, while the quickstart shows a minimal local flow: install the package, launch chandra_vllm, then run OCR on a PDF from the command line quickstart.

The repository description adds the implementation details engineers will care about: local inference via Hugging Face Transformers or a production-oriented vLLM server, structured outputs with layout coordinates, and export targets including Markdown, HTML, and JSON GitHub repo. The same repo also frames the target workload as harder document parsing, including handwriting, tables with merged cells, equations rendered as LaTeX, forms, invoices, and multi-column pages repo docs.

How strong is it, and where does it still break?

The headline metric is 85.9 on olmOCR Bench, which Datalab describes as state of the art, alongside a multilingual eval showing “major improvements across languages” in the launch thread eval thread. A benchmark screenshot from a separate post places datalab-to/chandra-ocr-2 at the top of the allenai/olmOCR-bench leaderboard, ahead of dots.ocr-1.5 at 83.9 and LightOnOCR-2-1B at 83.2 leaderboard screenshot.

The sample outputs are aimed at the messy cases that usually force fallback logic in document pipelines. The image set shows extraction of Chinese academic text with formulas, handwritten math notes, and layout-heavy matrix notation, with the rendered side preserving structure instead of flattening everything into plain text

The release is not pitched as flawless. Datalab says known limitations include cases where leading line numbers are reproduced verbatim, which matters for downstream parsing, chunking, and citation-sensitive workflows limitations.

TL;DR

What shipped for document pipelines?

How strong is it, and where does it still break?

Discussion across the web