updateMarch 20, 2026

Reason-ModernColBERT claims nearly 90% on BrowseComp-Plus with a 150M retriever

LightOn says its 150M multi-vector retriever is pushing BrowseComp-Plus close to saturation, with results showing search-call behavior and retriever choice matter nearly as much as model size. Retrieval engineers should watch multi-hop setup and tool-calling limits before copying the benchmark.

3 min read

Reason-ModernColBERT claims nearly 90% on BrowseComp-Plus with a 150M retriever

TL;DR

LightOn's Antoine Chaffin says a 150M retriever, Reason-ModernColBERT, has pushed BrowseComp-Plus to "nearly 90%" and outperformed much larger retrievers, including models "54× bigger," though the post frames that as a benchmark claim rather than a full paper release launch claim.
The strongest concrete evidence in the thread is that retriever choice sharply changes end-to-end agent accuracy: the results table shows GPT-5 rising from 55.90% with BM25 to 70.12% with Qwen3-Embed-8B, while o3 moves from 49.28% to 63.49%.
BrowseComp-Plus appears heavily constrained by multi-hop search behavior, not just model size: Ben Clavié's task example says the benchmark is "near-impossible single hop," and Chaffin's tool-calling note ties poor scores to agents making only one or fewer search calls.
LightOn is already reframing the result as a cost floor question: Chaffin's next challenge says the team has "almost saturated BrowseComp-Plus" with an older 150M model and now wants to see "the cheapest setup" that can solve it.

What changed on BrowseComp-Plus

Chaffin is positioning the result as a retrieval story, not a frontier-model story. His launch claim says Reason-ModernColBERT, a 150M "multi-vector model," now solves BrowseComp-Plus at nearly 90% and beats larger baselines across metrics, while a follow-up next challenge says the team has "almost saturated BrowseComp-Plus" despite using "an old model" with more ideas left to test.

The oracle comparison in the [img:5|oracle retrieval] screenshot explains why that claim matters. In that setup, GPT-4.1 reaches 93.49% when given all labeled positive documents, versus 14.58% with BM25, which suggests the benchmark ceiling is largely reachable if retrieval is strong enough and that the remaining gap is not mostly a corpus-quality problem.

Why the setup matters more than raw model size

The thread's most useful engineering detail is that retriever quality and tool-use behavior move scores almost as much as the LLM does. The results table shows GPT-4.1 at 14.58% with BM25 and 35.42% with Qwen3-Embed-8B; o3 at 49.28% versus 63.49%; and GPT-5 at 55.90% versus 70.12%. In the same table, Qwen3-32B stays near 10% regardless of retriever and averages under one search call, which Chaffin's tool-calling caveat attributes to weak tool calling rather than pure model size.

Clavié's task example gives the clearest reason single-turn shortcuts are unlikely to transfer: BrowseComp-Plus queries are designed to require chained evidence, and he says "10-15% would be a hard limit" for single-hop approaches. Chaffin's tool-calling note makes the same point from the results side, arguing that models that "struggle to call the search tool" and stay at one or fewer calls post "very bad results."

TL;DR

What changed on BrowseComp-Plus

Why the setup matters more than raw model size

Discussion across the web