Skip to content
AI Primer
release

Kilo Code adds Terminal Bench scores and average attempt cost to model picker

Kilo Code now shows Terminal Bench completion rate and average attempt cost directly in model details inside its CLI and VS Code extension. It matters because the numbers come from Kilo's own harness and retry logic rather than public leaderboard scaffolds.

3 min read
Kilo Code adds Terminal Bench scores and average attempt cost to model picker
Kilo Code adds Terminal Bench scores and average attempt cost to model picker

TL;DR

  • Kilo Code now surfaces two new model selection fields, Terminal Bench completion rate and average attempt cost, directly inside its CLI and VS Code extension, according to Kilo's feature post.
  • The benchmark number is not a pasted public leaderboard score. As Kilo's harness note puts it, the displayed result comes from Kilo's own tool pipeline, context management, and retry logic.
  • Kilo's example screenshot in Kilo's Terminal Bench comparison shows GPT-5.5 at 74.1% completion and Kimi K2.6 at 54.4% on terminal coding tasks measured through that harness.
  • The broader bet, stated in Kilo's rollout thread, is that benchmark and cost data should live in the model picker instead of a separate tab, with deeper tables available through Kilo's leaderboard and a linked write-up.

Kilo also published a short demo in Kilo's 40-second walkthrough, and the linked materials go past the UI change. You can open the full table, read the write-up, and dig into a separate cost breakdown that argues most agent steps do not need frontier-priced models.

Model picker

The shipped change is small and useful. Model details now include two operational numbers in the same panel as price and context limits, per Kilo's feature post.

Those two numbers are:

  • Terminal Bench score, a completion rate
  • Average attempt cost, defined by Kilo's feature post as what each run actually cost Kilo

That makes the picker less of a memory test. Kilo's launch post framed the old workflow as guessing from half-remembered leaderboard scores and list prices.

Kilo harness

The more important detail is where the score comes from. Kilo's harness note says these are not public Terminal Bench 2.0 numbers from optimized scaffolds, but measurements from Kilo's own harness.

Kilo says that harness includes its tool pipeline, context management, and retry logic Kilo's harness note. That makes the score closer to a product-specific eval than a generic model card stat.

The example Kilo chose to show in Kilo's Terminal Bench comparison was:

  • GPT-5.5: 74.1%
  • Kimi K2.6: 54.4%

Kilo explicitly described those as real terminal-based coding tasks rather than synthetic puzzles in Kilo's Terminal Bench comparison.

Average attempt cost

Kilo is pairing the benchmark number with a cost number for a reason. In Kilo's cost thread, the company contrasted a $15 run on Opus with a $1.70 run on a cheaper model for the same task, then linked to a fuller breakdown of where agent spend goes.

That same framing shows up in Kilo's rollout thread, which pitches the picker as the place where capability and cost should meet. The product change is not just more benchmark visibility, it is an attempt to collapse score, pricing, and real run cost into one selection surface.

Leaderboard cadence

Kilo's public leaderboard is doing more than mirroring the in-product fields. In Kilo's efficiency-frontier reply, Kilo says its Kilo Bench view plots completion rate against cost per attempt to show an efficiency frontier, and that the data refreshes every five minutes.

That post also names a few models Kilo saw as value plays at the time, including MiMo-V2.5-Pro, MiniMax M3, GLM 5.1, and Grok Build 0.1 Kilo's efficiency-frontier reply. The in-editor numbers are the thin slice. The leaderboard behind them is the moving reference layer.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
Model picker1 post
Share on X