Skip to content
AI Primer
release

LightOn releases LateOn and DenseOn at 149M params with BEIR 57.22

LightOn open-sourced DenseOn and LateOn plus the training pipeline behind them, including 1.4 billion query-document pairs and decontaminated BEIR results. Teams can use the small open retrieval models and reproduced data mixtures instead of opaque closed-data baselines.

4 min read
LightOn releases LateOn and DenseOn at 149M params with BEIR 57.22
LightOn releases LateOn and DenseOn at 149M params with BEIR 57.22

TL;DR

You can read the official blog post, browse the full models and datasets collection, inspect the 1.4B-pair pretraining data, and jump straight to PyLate's docs. beirmug's reply immediately zeroed in on the interesting part, open data, methods, and models all landing together.

DenseOn and LateOn

The headline numbers are simple. antoine_chaffin's DenseOn result post says DenseOn reached 56.20 on BEIR, making it the first sub-150M dense retriever past 56, while antoine_chaffin's LateOn result post says LateOn reached 57.22, the first ColBERT and first sub-150M model past 57.

The comparison points matter because both models are small. According to the DenseOn result post, DenseOn beat GTE-ModernBERT at 55.19 and also cleared larger models including Arctic-Embed-L-v2 at 568M and Qwen3-Embedding-0.6B at 595M. According to the LateOn result post, LateOn beat ColBERT-Zero at 55.32 by nearly two points and did it without prompts or knowledge distillation.

The official package is split cleanly:

The 1.4B-pair dataset release

The more bookmarkable part of this launch is the data. antoine_chaffin's dataset post says LightOn released more than 1.4B query-document pairs and added a new web dataset built on FineWeb-Edu instead of older Common Crawl-heavy mixtures.

The release is structured in layers, not as one opaque blob:

  1. Raw pretraining sources in the embeddings-pre-training dataset
  2. A curated mixture with filter metadata in the embeddings-pre-training-curated dataset
  3. Fine-tuning data with mined hard negatives in the embeddings-fine-tuning dataset
  4. Model checkpoints and companion assets in the DenseOn and LateOn collection

antoine_chaffin's curation post says the curation steps were non-destructive, with per-source structural filters, deduplication, and cross-encoder relevance scores exposed as metadata. his fine-tuning data post says the team also released all 2048 nearest candidates and retriever scores for 1.88M queries, which makes the hard-negative setup reusable for other thresholds and even knowledge-distillation style training.

Decontaminated BEIR

LightOn leaned hard into contamination as the credibility test. antoine_chaffin's decontamination note says the team removed BEIR data overlapping with training data, then re-ran the evals.

According to the decontaminated BEIR post, LateOn kept the top spot and DenseOn stayed in the top four even though the decontamination set was derived from LightOn's own data. antoine_chaffin's generalization post adds a broader pattern: all three ColBERT models held or improved rank after decontamination, while some dense baselines dropped sharply, including GTE-ModernBERT and Qwen3-Embedding.

That makes this more useful than a leaderboard screenshot. beirmug's follow-up specifically called out the decontamination direction as interesting, and antoine_chaffin's reply argued that open data makes it easier to test whether new methods are actually learning or just overfitting.

The open retrieval pipeline

LateOn and DenseOn are being presented as the end of a pipeline, not a one-off checkpoint drop. In antoine_chaffin's pipeline post, Chaffin says LightOn has now released the code, data, and insights needed to go from random weights to production-ready retrieval models.

That pipeline now has four public pieces:

That full-stack framing is what got practitioners excited fastest. tomaarsen's reaction called it a "banger of a release for anyone into retrieval," while ManuelFaysse's quoted reaction said the missing piece was the data, not another isolated model release.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 5 threads
TL;DR2 posts
DenseOn and LateOn1 post
The 1.4B-pair dataset release2 posts
Decontaminated BEIR3 posts
The open retrieval pipeline2 posts