ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%
ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.

TL;DR
- ValsAI said a score discrepancy review traced Terminal Bench 2 deltas to undocumented
tool_choicebehavior when the harness was not using native tools ValsAI's investigation thread. - In the setup Vals was testing, the benchmark expected command batches in JSON, so its harness sent
"tool_choice": "none"under the assumption that this matched omitting tools, an assumption Vals said the docs implied ValsAI ontool_choiceand docs. - After rerunning, Vals moved GPT-5.5 to the top spot on Terminal Bench 2 with an 11% improvement, while keeping it at No. 2 on the broader Vals Index ValsAI's updated GPT-5.5 results.
- The post-correction episode also turned into a live demo of harness sensitivity: Vtrivedy10's follow-up said a single steering instruction was worth roughly a 12% swing on the same benchmark.
OpenAI's function-calling guide lists auto, required, and forced-function modes, but Vals says its rerun was triggered by behavior around "tool_choice": "none" that the docs did not spell out. Harbor's Terminus-2 docs describe the benchmark's reference agent as a mono-tool tmux setup rather than a native tool-calling harness. Vals' refreshed GPT-5.5 scorecard now shows Terminal-Bench 2 at 73.20%, rank 1, while the tweet announcing the rerun says the model remains rank 2 on the Vals Index.
`tool_choice: "none"`
Vals framed the bug as a mismatch between documented expectations and actual benchmark behavior. According to ValsAI on tool_choice and docs, its harness passed "tool_choice": "none" because Terminal Bench 2 was not using native tools, and Vals believed that setting was equivalent to tools: [] or omitting the parameter.
OpenAI's function-calling guide documents auto, required, and forced-function control, but the public page Exa surfaced does not describe a none option. That gap is the whole story here, because the benchmark delta came from a harness flag that looked semantically harmless.
Terminus-2
The rerun only makes sense in the context of how Terminal Bench 2 works. As ValsAI's Terminus-2 setup note puts it, the default Terminus-2 harness does not use native tool calling, and instead expects the model to emit terminal commands directly in the response body as JSON.
That lines up with Harbor's Terminus-2 documentation, which describes a single interactive tmux session as the core interface. Harbor's prompt template on GitHub also shows the model being asked to return structured JSON with analysis, plan, and command batches.
GPT-5.5 moves to #1
Once Vals reran the evals, GPT-5.5 jumped to No. 1 on Terminal Bench 2. ValsAI's updated GPT-5.5 results called the lift an 11% improvement and said the model still sat at No. 2 on the Vals Index.
The attached scorecard and the linked Vals model page put the updated Terminal-Bench 2 number at 73.20% plus or minus 3.98, rank 1 out of 59. The same page lists the Vals Index result at 70.77% plus or minus 1.67, rank 2 out of 48, plus a 1M context window and 128k max output.
A single steering instruction moved the score too
The follow-up reaction from Vals cofounder Varun Trivedi made the benchmark sensitivity even starker. In Vtrivedy10's follow-up, he wrote that a single steering instruction with GPT-5.5 produced about a 12% change in Terminal Bench score.
That does not change ValsAI's claim about the tool_choice bug. It adds a second finding: even after the rerun, coding-agent benchmarks are still extremely exposed to harness and prompt details, especially on tasks where the model is being judged through a structured agent wrapper rather than raw next-token capability.