LightOn says its 150M multi-vector retriever is pushing BrowseComp-Plus close to saturation, with results showing search-call behavior and retriever choice matter nearly as much as model size. Retrieval engineers should watch multi-hop setup and tool-calling limits before copying the benchmark.

Chaffin is positioning the result as a retrieval story, not a frontier-model story. His launch claim says Reason-ModernColBERT, a 150M "multi-vector model," now solves BrowseComp-Plus at nearly 90% and beats larger baselines across metrics, while a follow-up next challenge says the team has "almost saturated BrowseComp-Plus" despite using "an old model" with more ideas left to test.
The oracle comparison in the [img:5|oracle retrieval] screenshot explains why that claim matters. In that setup, GPT-4.1 reaches 93.49% when given all labeled positive documents, versus 14.58% with BM25, which suggests the benchmark ceiling is largely reachable if retrieval is strong enough and that the remaining gap is not mostly a corpus-quality problem.
The thread's most useful engineering detail is that retriever quality and tool-use behavior move scores almost as much as the LLM does. The results table shows GPT-4.1 at 14.58% with BM25 and 35.42% with Qwen3-Embed-8B; o3 at 49.28% versus 63.49%; and GPT-5 at 55.90% versus 70.12%. In the same table, Qwen3-32B stays near 10% regardless of retriever and averages under one search call, which Chaffin's tool-calling caveat attributes to weak tool calling rather than pure model size.
Clavié's task example gives the clearest reason single-turn shortcuts are unlikely to transfer: BrowseComp-Plus queries are designed to require chained evidence, and he says "10-15% would be a hard limit" for single-hop approaches. Chaffin's tool-calling note makes the same point from the results side, arguing that models that "struggle to call the search tool" and stay at one or fewer calls post "very bad results."
Epoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
I really do not know because I did not run those However, given the number of search calls, I would think that only 1 would be limitating Maybe @zijian42chen ran experiments in this direction? Also maybe there is obviously a bias due to the LLM, but you can see that the models Show more
We've almost saturated BrowseComp-Plus with a 150M model... ... but this was an old model and I had a lot of ideas to improve the results 🙁 So maybe it's time to kick off a new challenge and see what's the cheapest setup we can solve BrowseComp-Plus with?
It wouldn't be a very hard experiment to run, but I'm pretty sure that single-turn would not work by design on BC-Plus. The queries are specifically designed to need multi-hop, e.g. this is one query from it: "A Harvard award-winning author wrote an article less than 5 years Show more