LightOn releases LateOn and DenseOn at 149M params with BEIR 57.22
LightOn open-sourced DenseOn and LateOn plus the training pipeline behind them, including 1.4 billion query-document pairs and decontaminated BEIR results. Teams can use the small open retrieval models and reproduced data mixtures instead of opaque closed-data baselines.

TL;DR
- LightOn shipped two Apache-2.0 retrieval models at 149M parameters each, DenseOn for single-vector retrieval and LateOn for ColBERT-style multi-vector retrieval, according to antoine_chaffin's launch post and the linked Hugging Face collection.
- On BEIR, antoine_chaffin's DenseOn result post put DenseOn at 56.20, while antoine_chaffin's LateOn result post put LateOn at 57.22, both above several much larger open baselines.
- The bigger release is the training stack: antoine_chaffin's data release post says LightOn published more than 1.4B query-document pairs, plus curated pretraining and fine-tuning datasets on Hugging Face.
- Contamination was the obvious objection, and antoine_chaffin's decontamination note plus his follow-up on decontaminated BEIR say the models kept their rank after overlapping BEIR data was removed.
- antoine_chaffin's pipeline post framed the drop as a full open path from random weights to production-ready retrievers, with PyLate included for late-interaction serving and training.
You can read the official blog post, browse the full models and datasets collection, inspect the 1.4B-pair pretraining data, and jump straight to PyLate's docs. beirmug's reply immediately zeroed in on the interesting part, open data, methods, and models all landing together.
DenseOn and LateOn
The headline numbers are simple. antoine_chaffin's DenseOn result post says DenseOn reached 56.20 on BEIR, making it the first sub-150M dense retriever past 56, while antoine_chaffin's LateOn result post says LateOn reached 57.22, the first ColBERT and first sub-150M model past 57.
The comparison points matter because both models are small. According to the DenseOn result post, DenseOn beat GTE-ModernBERT at 55.19 and also cleared larger models including Arctic-Embed-L-v2 at 568M and Qwen3-Embedding-0.6B at 595M. According to the LateOn result post, LateOn beat ColBERT-Zero at 55.32 by nearly two points and did it without prompts or knowledge distillation.
The official package is split cleanly:
- DenseOn: single-vector dense retrieval, 149M params, via the model collection
- LateOn: ColBERT-style multi-vector retrieval, 149M params, via the same collection
- Serving and experimentation: PyLate for late-interaction workflows, linked by antoine_chaffin's resource post
The 1.4B-pair dataset release
The more bookmarkable part of this launch is the data. antoine_chaffin's dataset post says LightOn released more than 1.4B query-document pairs and added a new web dataset built on FineWeb-Edu instead of older Common Crawl-heavy mixtures.
The release is structured in layers, not as one opaque blob:
- Raw pretraining sources in the embeddings-pre-training dataset
- A curated mixture with filter metadata in the embeddings-pre-training-curated dataset
- Fine-tuning data with mined hard negatives in the embeddings-fine-tuning dataset
- Model checkpoints and companion assets in the DenseOn and LateOn collection
antoine_chaffin's curation post says the curation steps were non-destructive, with per-source structural filters, deduplication, and cross-encoder relevance scores exposed as metadata. his fine-tuning data post says the team also released all 2048 nearest candidates and retriever scores for 1.88M queries, which makes the hard-negative setup reusable for other thresholds and even knowledge-distillation style training.
Decontaminated BEIR
LightOn leaned hard into contamination as the credibility test. antoine_chaffin's decontamination note says the team removed BEIR data overlapping with training data, then re-ran the evals.
According to the decontaminated BEIR post, LateOn kept the top spot and DenseOn stayed in the top four even though the decontamination set was derived from LightOn's own data. antoine_chaffin's generalization post adds a broader pattern: all three ColBERT models held or improved rank after decontamination, while some dense baselines dropped sharply, including GTE-ModernBERT and Qwen3-Embedding.
That makes this more useful than a leaderboard screenshot. beirmug's follow-up specifically called out the decontamination direction as interesting, and antoine_chaffin's reply argued that open data makes it easier to test whether new methods are actually learning or just overfitting.
The open retrieval pipeline
LateOn and DenseOn are being presented as the end of a pipeline, not a one-off checkpoint drop. In antoine_chaffin's pipeline post, Chaffin says LightOn has now released the code, data, and insights needed to go from random weights to production-ready retrieval models.
That pipeline now has four public pieces:
- Backbone and training ingredients discussed in the official blog post
- Pretraining sources and curated mixtures in the three Hugging Face datasets
- Final model artifacts in the DenseOn and LateOn collection
- Late-interaction tooling in PyLate
That full-stack framing is what got practitioners excited fastest. tomaarsen's reaction called it a "banger of a release for anyone into retrieval," while ManuelFaysse's quoted reaction said the missing piece was the data, not another isolated model release.