breakingMarch 19, 2026

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

LightOn’s late-interaction retriever paired with GPT-5 reached 87.59 accuracy on BrowseComp-Plus while using fewer search calls than larger baselines. It suggests deep-research quality may now hinge more on retrieval architecture than on swapping in ever larger LLMs.

Search Reranking Benchmarks Deep Research

4 min read

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

TL;DR

LightOn’s benchmark thread says its 150M-parameter Reason-ModernColBERT, paired with GPT-5, reached 87.59 accuracy on BrowseComp-Plus, a 7.59-point gain over the previous best.
The same benchmark thread reports wins on recall and calibration too: 83.52% recall versus 80.29%, and 7.46 calibration error versus 7.92, while using fewer search calls.
Practitioner reaction centered on model size efficiency: follow-up thread notes the strongest prior retriever baseline was Qwen3-8B, about 54 times larger than the 150M ColBERT model.
The result also points to a workflow change, because scaffold details says a simple get_document(id) fetch tool improved performance over the official top-5-snippet-only scaffold, suggesting retrieval quality is driving deep-research performance more than bigger generators alone.

What actually beat the prior BrowseComp-Plus runs?

According to the primary results thread, Reason-ModernColBERT topped BrowseComp-Plus with GPT-5 at 87.59 accuracy, improving on the prior best by 7.59 points while also setting the best reported recall and calibration error in the same run. That matters because earlier bests on those metrics came from different runs, so this is not just a single-metric win.

The size delta is unusually large for a retriever result. In a companion reaction, the baseline being beaten is described as Qwen3-8B, making the winning retriever roughly 54 times smaller at 150M parameters. LightOn’s own thread adds that even its generic GTE-ModernColBERT base model outperformed Qwen3-8B, which suggests the gain is not only from reasoning fine-tuning but from the retrieval architecture itself base-vs-reasoning comparison.

Why does the retrieval setup matter so much?

The most concrete implementation detail in the primary thread is the scaffold difference. The official BrowseComp-Plus setup exposes search returning top-5 documents plus the first 512 tokens of each, while LightOn also tested a minimal variant that adds get_document(id) so the LLM can pull full documents on demand. The post says that simple change both boosted performance and reduced search calls, which implies the retriever is surfacing high-signal candidates early enough that the model can spend fewer tool invocations.

That interpretation is echoed in community discussion. In a direct reply, Jo Kristian Bergum argues the result shows “deep research is a retrieval problem, not a reasoning problem” and says the score is approaching oracle-level accuracy. A separate reaction from the late-interaction thread makes the sharper claim that dense single-vector retrievers are the real bottleneck on quality and generalization, not that late interaction is merely incrementally better.

What can engineers actually use from this result?

This was not posted as a closed benchmark stunt. The main thread says the models, training code, and data are open, with links to the Hugging Face checkpoints Reason model and GTE model, plus PyLate training examples GTE training code and Reason training code. It also says the models were trained in about four hours and that PyLate keeps the workflow close to Sentence Transformers, so existing dense-retrieval pipelines can be adapted rather than rebuilt PyLate docs.

There is also some immediate systems-level optimization work around this model family. In the Sentence Transformers note, Tom Aarsen says he is adding input flattening to remove padding tokens with Flash Attention 2 and has seen about “+50% training and inference speed” in tests. Combined with the benchmark thread showing fewer search calls at higher quality, the engineering takeaway is that late-interaction retrieval is getting both algorithmic and runtime wins at the same time.

Antoine Chaffin also said in a follow-up reply that BrowseComp-Plus may be “almost solved” and that harder datasets may be needed next, which is a sign this result is pushing beyond leaderboard churn into benchmark-saturation territory.

🧾 More sources

TL;DR1 tweets

Top-line benchmark result, metric deltas, and the main implementation takeaway from the scaffold change.

What actually beat the prior BrowseComp-Plus runs?1 tweets

Core benchmark claims and the model-size comparison that makes the result notable for engineers evaluating retrieval stacks.

Why does the retrieval setup matter so much?1 tweets

Evidence on the scaffold change and practitioner interpretation that retrieval architecture, not just larger LLMs, is driving the gain.

What can engineers actually use from this result?1 tweets

Open artifacts and runtime optimization details that make the result actionable beyond the benchmark itself.

breakingMarch 19, 2026

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

Search Reranking Benchmarks Deep Research

4 min read

TL;DR

LightOn’s benchmark thread says its 150M-parameter Reason-ModernColBERT, paired with GPT-5, reached 87.59 accuracy on BrowseComp-Plus, a 7.59-point gain over the previous best.
The same benchmark thread reports wins on recall and calibration too: 83.52% recall versus 80.29%, and 7.46 calibration error versus 7.92, while using fewer search calls.
Practitioner reaction centered on model size efficiency: follow-up thread notes the strongest prior retriever baseline was Qwen3-8B, about 54 times larger than the 150M ColBERT model.
The result also points to a workflow change, because scaffold details says a simple get_document(id) fetch tool improved performance over the official top-5-snippet-only scaffold, suggesting retrieval quality is driving deep-research performance more than bigger generators alone.

What actually beat the prior BrowseComp-Plus runs?

Antoine Chaffin

@antoine_chaffin

·Follow

Replying to @antoine_chaffin

Thanks to @raphaelsrty for being my awesome PyLate co-maintainer and making all of these results possible Thanks to @bclavie for preaching BrowseComp-Plus, which motivated me to try it out Thanks to @zijian42chen for building BrowseComp-Plus and being so helpful with merging the Show more

3:14 PM · Mar 19, 2026

Read 1 reply

Why does the retrieval setup matter so much?

Jo Kristian Bergum

@jobergum

·Follow

Replying to @antoine_chaffin

Yes, the anserini bm25 default hyperparameters are very far from ideal on these long documents. Great work with the multi-vector representation demonstrates that deep research is a retrieval problem, not a reasoning problem. Congratulations on approaching oracle level accuracy!

4:20 PM · Mar 19, 2026

Read 1 reply

What can engineers actually use from this result?

tomaarsen

@tomaarsen

·Follow

This is very very nice to see, multi-vector models really seem like a class of their own here!

Antoine Chaffin

@antoine_chaffin

BrowseComp-Plus, perhaps the hardest popular deep research task, is now solved at nearly 90%... ... and all it took was a 150M model ✨ Thrilled to announce that Reason-ModernColBERT did it again and outperform all models (including models 54× bigger) on all metrics

3:19 PM · Mar 19, 2026

🧾 More sources

TL;DR1 tweets

Top-line benchmark result, metric deltas, and the main implementation takeaway from the scaffold change.

What actually beat the prior BrowseComp-Plus runs?1 tweets

Core benchmark claims and the model-size comparison that makes the result notable for engineers evaluating retrieval stacks.

Why does the retrieval setup matter so much?1 tweets

Evidence on the scaffold change and practitioner interpretation that retrieval architecture, not just larger LLMs, is driving the gain.

What can engineers actually use from this result?1 tweets

Open artifacts and runtime optimization details that make the result actionable beyond the benchmark itself.