Chandra OCR 2 opens with 85.9 on olmOCR Bench and 90+ language support

Replying to @VikParuchuri

How to get it: - Huggingface - huggingface.co/datalab-to/cha… - Github - github.com/datalab-to/cha… - Demo - datalab.to/playground

Read 1 reply

Chandra OCR 2 is available as open-source weights, code, and a hosted demo rather than only as a managed OCR API. The release thread points users to a GitHub repo and HF weights, while the quickstart shows a minimal local flow: install the package, launch chandra_vllm, then run OCR on a PDF from the command line quickstart.

The repository description adds the implementation details engineers will care about: local inference via Hugging Face Transformers or a production-oriented vLLM server, structured outputs with layout coordinates, and export targets including Markdown, HTML, and JSON GitHub repo. The same repo also frames the target workload as harder document parsing, including handwriting, tables with merged cells, equations rendered as LaTeX, forms, invoices, and multi-column pages repo docs.

How strong is it, and where does it still break?

I'm excited to open source Chandra OCR 2! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support

539

Read 38 replies

The headline metric is 85.9 on olmOCR Bench, which Datalab describes as state of the art, alongside a multilingual eval showing “major improvements across languages” in the launch thread eval thread. A benchmark screenshot from a separate post places datalab-to/chandra-ocr-2 at the top of the allenai/olmOCR-bench leaderboard, ahead of dots.ocr-1.5 at 83.9 and LightOnOCR-2-1B at 83.2 leaderboard screenshot.

Replying to @VikParuchuri

Here are a few more examples of math, handwriting, image captioning, and layout:

Read 1 reply

The sample outputs are aimed at the messy cases that usually force fallback logic in document pipelines. The image set shows extraction of Chinese academic text with formulas, handwritten math notes, and layout-heavy matrix notation, with the rendered side preserving structure instead of flattening everything into plain text

Replying to @VikParuchuri

Chandra OCR 2 has some known limitations we're working on: - Leading line numbers will sometimes be included verbatim - Very complex newspaper layouts may skip some text