breakingMay 1, 2026

ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.

3 min read

ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

TL;DR

ValsAI said a score discrepancy review traced Terminal Bench 2 deltas to undocumented tool_choice behavior when the harness was not using native tools ValsAI's investigation thread.
In the setup Vals was testing, the benchmark expected command batches in JSON, so its harness sent "tool_choice": "none" under the assumption that this matched omitting tools, an assumption Vals said the docs implied ValsAI on tool_choice and docs.
After rerunning, Vals moved GPT-5.5 to the top spot on Terminal Bench 2 with an 11% improvement, while keeping it at No. 2 on the broader Vals Index ValsAI's updated GPT-5.5 results.
The post-correction episode also turned into a live demo of harness sensitivity: Vtrivedy10's follow-up said a single steering instruction was worth roughly a 12% swing on the same benchmark.

OpenAI's function-calling guide lists auto, required, and forced-function modes, but Vals says its rerun was triggered by behavior around "tool_choice": "none" that the docs did not spell out. Harbor's Terminus-2 docs describe the benchmark's reference agent as a mono-tool tmux setup rather than a native tool-calling harness. Vals' refreshed GPT-5.5 scorecard now shows Terminal-Bench 2 at 73.20%, rank 1, while the tweet announcing the rerun says the model remains rank 2 on the Vals Index.

`tool_choice: "none"`

Vals framed the bug as a mismatch between documented expectations and actual benchmark behavior. According to ValsAI on tool_choice and docs, its harness passed "tool_choice": "none" because Terminal Bench 2 was not using native tools, and Vals believed that setting was equivalent to tools: [] or omitting the parameter.

OpenAI's function-calling guide documents auto, required, and forced-function control, but the public page Exa surfaced does not describe a none option. That gap is the whole story here, because the benchmark delta came from a harness flag that looked semantically harmless.

Terminus-2

The rerun only makes sense in the context of how Terminal Bench 2 works. As ValsAI's Terminus-2 setup note puts it, the default Terminus-2 harness does not use native tool calling, and instead expects the model to emit terminal commands directly in the response body as JSON.

That lines up with Harbor's Terminus-2 documentation, which describes a single interactive tmux session as the core interface. Harbor's prompt template on GitHub also shows the model being asked to return structured JSON with analysis, plan, and command batches.

GPT-5.5 moves to #1

Once Vals reran the evals, GPT-5.5 jumped to No. 1 on Terminal Bench 2. ValsAI's updated GPT-5.5 results called the lift an 11% improvement and said the model still sat at No. 2 on the Vals Index.

The attached scorecard and the linked Vals model page put the updated Terminal-Bench 2 number at 73.20% plus or minus 3.98, rank 1 out of 59. The same page lists the Vals Index result at 70.77% plus or minus 1.67, rank 2 out of 48, plus a 1M context window and 128k max output.

A single steering instruction moved the score too

The follow-up reaction from Vals cofounder Varun Trivedi made the benchmark sensitivity even starker. In Vtrivedy10's follow-up, he wrote that a single steering instruction with GPT-5.5 produced about a 12% change in Terminal Bench score.

That does not change ValsAI's claim about the tool_choice bug. It adds a second finding: even after the rerun, coding-agent benchmarks are still extremely exposed to harness and prompt details, especially on tasks where the model is being judged through a structured agent wrapper rather than raw next-token capability.

TL;DR

`tool_choice: "none"`

Terminus-2

GPT-5.5 moves to #1

A single steering instruction moved the score too

Discussion across the web