OpenAI introduces GeneBench-Pro with GPT-5.6 Sol Pro at 31.5%
OpenAI introduced GeneBench-Pro to test whether agents can handle messy, judgment-heavy computational biology work instead of fixed bio QA. GPT-5.6 Sol Pro reached 31.5%, which shows progress on research workflows but also how far current systems remain from expert-level autonomy.

TL;DR
- OpenAI says its GeneBench-Pro announcement is meant to test agent performance on messy computational biology work, not biology trivia or fixed pipelines.
- According to Wes Roth's summary and reach_vb's breakdown, the benchmark scores whether a model can inspect noisy data, catch problems, revise assumptions, and defend a conclusion.
- deredleritt3r's score chart puts GPT-5.6 Sol Pro at 31.5%, up from 20.5% for GPT-5.5 Pro, a gain of 11 points on OpenAI's hardest reported setting.
- The attached benchmark chart in gdb's post shows GPT-5.6 Sol Pro ahead of Opus 4.8 at 16.0%, Gemini 3.5 Flash at 8.1%, and DeepSeek V4 Pro at 2.4%.
- Even the top result is far from autonomy: Wes Roth's summary says experts estimated a typical GeneBench-Pro problem would take a human 20 to 40 hours.
You can jump to OpenAI's announcement, inspect the
, and the
is the most interesting weird bit, because the strongest line is not the one that spends the most tokens.
GeneBench-Pro
GeneBench-Pro is framed around judgment calls that show up in real computational research. OpenAI's launch post describes agents navigating messy biological data, while reach_vb's summary spells that out as inspecting data, catching bad samples, choosing an analysis, revising assumptions, and producing a defensible conclusion.
That makes it a workflow benchmark more than a recall benchmark. Wes Roth's summary says the goal is to see whether agents can handle noisy, judgment-heavy work instead of fixed biology QA.
GPT-5.6 Sol Pro
The headline number is 31.5% for GPT-5.6 Sol Pro. deredleritt3r's score post also lists the earlier Pro results in sequence:
- GPT-5.2 Pro: 8.5%
- GPT-5.4 Pro: 16.3%
- GPT-5.5 Pro: 20.5%
- GPT-5.6 Sol Pro: 31.5%
The same chart in gdb's post places OpenAI's top model well ahead of the non-OpenAI models shown there, with Opus 4.8 at 16.0% and Gemini 3.5 Flash at 8.1%.
Test-time compute scaling
The scaling plot adds a more useful detail than the headline score. The OCR from
shows GPT-5.6 Sol reaching the highest pass rate at roughly 35,000 tokens used, while GPT-5.6 Luna stretches much farther right, to about 120,000 tokens, and still tops out much lower.
That suggests extra test-time compute is not carrying weaker variants to the same ceiling. In this benchmark, model choice still matters more than simply letting an agent think longer.
Human-time gap
OpenAI's own framing leaves the brag muted. Wes Roth's summary and gdb's post both say experts estimated a typical GeneBench-Pro problem would take a human researcher 20 to 40 hours.
A 31.5% pass rate is still notable on tasks with that kind of labor behind them. It also means the benchmark is measuring a part of agent progress where failure still dominates the distribution.