Skip to content
AI Primer
breaking

ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.

3 min read
ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%
ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

TL;DR

  • ValsAI said a score discrepancy review traced Terminal Bench 2 deltas to undocumented tool_choice behavior when the harness was not using native tools ValsAI's investigation thread.
  • In the setup Vals was testing, the benchmark expected command batches in JSON, so its harness sent "tool_choice": "none" under the assumption that this matched omitting tools, an assumption Vals said the docs implied ValsAI on tool_choice and docs.
  • After rerunning, Vals moved GPT-5.5 to the top spot on Terminal Bench 2 with an 11% improvement, while keeping it at No. 2 on the broader Vals Index ValsAI's updated GPT-5.5 results.
  • The post-correction episode also turned into a live demo of harness sensitivity: Vtrivedy10's follow-up said a single steering instruction was worth roughly a 12% swing on the same benchmark.

OpenAI's function-calling guide lists auto, required, and forced-function modes, but Vals says its rerun was triggered by behavior around "tool_choice": "none" that the docs did not spell out. Harbor's Terminus-2 docs describe the benchmark's reference agent as a mono-tool tmux setup rather than a native tool-calling harness. Vals' refreshed GPT-5.5 scorecard now shows Terminal-Bench 2 at 73.20%, rank 1, while the tweet announcing the rerun says the model remains rank 2 on the Vals Index.

`tool_choice: "none"`

Vals framed the bug as a mismatch between documented expectations and actual benchmark behavior. According to ValsAI on tool_choice and docs, its harness passed "tool_choice": "none" because Terminal Bench 2 was not using native tools, and Vals believed that setting was equivalent to tools: [] or omitting the parameter.

OpenAI's function-calling guide documents auto, required, and forced-function control, but the public page Exa surfaced does not describe a none option. That gap is the whole story here, because the benchmark delta came from a harness flag that looked semantically harmless.

Terminus-2

The rerun only makes sense in the context of how Terminal Bench 2 works. As ValsAI's Terminus-2 setup note puts it, the default Terminus-2 harness does not use native tool calling, and instead expects the model to emit terminal commands directly in the response body as JSON.

That lines up with Harbor's Terminus-2 documentation, which describes a single interactive tmux session as the core interface. Harbor's prompt template on GitHub also shows the model being asked to return structured JSON with analysis, plan, and command batches.

GPT-5.5 moves to #1

Once Vals reran the evals, GPT-5.5 jumped to No. 1 on Terminal Bench 2. ValsAI's updated GPT-5.5 results called the lift an 11% improvement and said the model still sat at No. 2 on the Vals Index.

The attached scorecard and the linked Vals model page put the updated Terminal-Bench 2 number at 73.20% plus or minus 3.98, rank 1 out of 59. The same page lists the Vals Index result at 70.77% plus or minus 1.67, rank 2 out of 48, plus a 1M context window and 128k max output.

A single steering instruction moved the score too

The follow-up reaction from Vals cofounder Varun Trivedi made the benchmark sensitivity even starker. In Vtrivedy10's follow-up, he wrote that a single steering instruction with GPT-5.5 produced about a 12% change in Terminal Bench score.

That does not change ValsAI's claim about the tool_choice bug. It adds a second finding: even after the rerun, coding-agent benchmarks are still extremely exposed to harness and prompt details, especially on tasks where the model is being judged through a structured agent wrapper rather than raw next-token capability.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
`tool_choice: "none"`1 post
Share on X