breakingJune 30, 2026

OpenAI introduces GeneBench-Pro with GPT-5.6 Sol Pro at 31.5%

OpenAI introduced GeneBench-Pro to test whether agents can handle messy, judgment-heavy computational biology work instead of fixed bio QA. GPT-5.6 Sol Pro reached 31.5%, which shows progress on research workflows but also how far current systems remain from expert-level autonomy.

3 min read

OpenAI introduces GeneBench-Pro with GPT-5.6 Sol Pro at 31.5%

TL;DR

OpenAI says its GeneBench-Pro announcement is meant to test agent performance on messy computational biology work, not biology trivia or fixed pipelines.
According to Wes Roth's summary and reach_vb's breakdown, the benchmark scores whether a model can inspect noisy data, catch problems, revise assumptions, and defend a conclusion.
deredleritt3r's score chart puts GPT-5.6 Sol Pro at 31.5%, up from 20.5% for GPT-5.5 Pro, a gain of 11 points on OpenAI's hardest reported setting.
The attached benchmark chart in gdb's post shows GPT-5.6 Sol Pro ahead of Opus 4.8 at 16.0%, Gemini 3.5 Flash at 8.1%, and DeepSeek V4 Pro at 2.4%.
Even the top result is far from autonomy: Wes Roth's summary says experts estimated a typical GeneBench-Pro problem would take a human 20 to 40 hours.

You can jump to OpenAI's announcement, inspect the

, and the

is the most interesting weird bit, because the strongest line is not the one that spends the most tokens.

GeneBench-Pro

GeneBench-Pro is framed around judgment calls that show up in real computational research. OpenAI's launch post describes agents navigating messy biological data, while reach_vb's summary spells that out as inspecting data, catching bad samples, choosing an analysis, revising assumptions, and producing a defensible conclusion.

That makes it a workflow benchmark more than a recall benchmark. Wes Roth's summary says the goal is to see whether agents can handle noisy, judgment-heavy work instead of fixed biology QA.

GPT-5.6 Sol Pro

The headline number is 31.5% for GPT-5.6 Sol Pro. deredleritt3r's score post also lists the earlier Pro results in sequence:

GPT-5.2 Pro: 8.5%
GPT-5.4 Pro: 16.3%
GPT-5.5 Pro: 20.5%
GPT-5.6 Sol Pro: 31.5%

The same chart in gdb's post places OpenAI's top model well ahead of the non-OpenAI models shown there, with Opus 4.8 at 16.0% and Gemini 3.5 Flash at 8.1%.

Test-time compute scaling

The scaling plot adds a more useful detail than the headline score. The OCR from

shows GPT-5.6 Sol reaching the highest pass rate at roughly 35,000 tokens used, while GPT-5.6 Luna stretches much farther right, to about 120,000 tokens, and still tops out much lower.

That suggests extra test-time compute is not carrying weaker variants to the same ceiling. In this benchmark, model choice still matters more than simply letting an agent think longer.

Human-time gap

OpenAI's own framing leaves the brag muted. Wes Roth's summary and gdb's post both say experts estimated a typical GeneBench-Pro problem would take a human researcher 20 to 40 hours.

A 31.5% pass rate is still notable on tasks with that kind of labor behind them. It also means the benchmark is measuring a part of agent progress where failure still dominates the distribution.

TL;DR

GeneBench-Pro

GPT-5.6 Sol Pro

Test-time compute scaling

Human-time gap

Discussion across the web