Datalab-to open-sourced Chandra OCR 2, a 4B document model with repo, weights, demo, and CLI quickstart, and claims state-of-the-art 85.9 on olmOCR Bench. It gives document pipelines a practical multilingual OCR option that can run with local tooling instead of only hosted APIs.

chandra-ocr, starts chandra_vllm, and runs chandra input.pdf ./output quickstart.Chandra OCR 2 is available as open-source weights, code, and a hosted demo rather than only as a managed OCR API. The release thread points users to a GitHub repo and HF weights, while the quickstart shows a minimal local flow: install the package, launch chandra_vllm, then run OCR on a PDF from the command line quickstart.
The repository description adds the implementation details engineers will care about: local inference via Hugging Face Transformers or a production-oriented vLLM server, structured outputs with layout coordinates, and export targets including Markdown, HTML, and JSON GitHub repo. The same repo also frames the target workload as harder document parsing, including handwriting, tables with merged cells, equations rendered as LaTeX, forms, invoices, and multi-column pages repo docs.
The headline metric is 85.9 on olmOCR Bench, which Datalab describes as state of the art, alongside a multilingual eval showing “major improvements across languages” in the launch thread eval thread. A benchmark screenshot from a separate post places datalab-to/chandra-ocr-2 at the top of the allenai/olmOCR-bench leaderboard, ahead of dots.ocr-1.5 at 83.9 and LightOnOCR-2-1B at 83.2 leaderboard screenshot.
The sample outputs are aimed at the messy cases that usually force fallback logic in document pipelines. The image set shows extraction of Chinese academic text with formulas, handwritten math notes, and layout-heavy matrix notation, with the rendered side preserving structure instead of flattening everything into plain text
.
The release is not pitched as flawless. Datalab says known limitations include cases where leading line numbers are reproduced verbatim, which matters for downstream parsing, chunking, and citation-sensitive workflows limitations.
How to get it: - Huggingface - huggingface.co/datalab-to/cha… - Github - github.com/datalab-to/cha… - Demo - datalab.to/playground
I'm excited to open source Chandra OCR 2! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support
Here are a few more examples of math, handwriting, image captioning, and layout:
Chandra OCR 2 has some known limitations we're working on: - Leading line numbers will sometimes be included verbatim - Very complex newspaper layouts may skip some text