breakingMarch 23, 2026

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, aligning results more closely with the official benchmark setup. Top score dipped slightly to 78.8%, but the change reduces harness-specific confounds when comparing models.

2 min read

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

TL;DR

Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, with Vals AI's announcement framing the change as a cleaner way to measure model capability.
The reason for the swap is methodological: the thread says richer harnesses can boost scores but also "confound underlying model capabilities" with harness-specific adaptation.
Vals says its mini-swe-agent note now aligns the eval with the official SWE-bench leaderboard's default harness and keeps agents to "standard command line tools."
The scoreboard barely moved after the switch: according to Vals AI's results update, the top score slipped from 79.2% to 78.80% while the average rose from 63.8% to 65.9%.

Why Vals AI changed the harness

Vals AI changed the harness to reduce benchmark-specific scaffolding in its SWE-Bench Verified runs. In the follow-up post, the team says more complex harnesses can improve results, but that can blur whether a model is genuinely better at repo repair or simply better tuned to a particular agent stack.

The replacement is deliberately narrower. Vals AI's explanation describes mini-swe-agent as a "neutral evaluation setup" that tests models using only standard command-line tools, and says that choice also brings Vals closer to the official SWE-bench leaderboard's default harness. For engineers comparing coding models across leaderboards, that makes Vals' numbers easier to map to the benchmark's baseline setup.

How much the scores moved

The harness change did not materially reshuffle results. Vals reports in its results update that performance changed by only "a few percentage points for most providers," with the best score edging down from 79.2% to 78.80%.

The more surprising change was in the middle of the pack. The same results update says the average score increased from 63.8% to 65.9%, which suggests the simpler harness did not uniformly depress outcomes across vendors.

Vals says in the closing post that the full results are available on its website, but the thread's headline takeaway is narrower: switching to a bash-only harness changed the absolute top line only slightly while reducing one source of harness-specific variance in SWE-Bench Verified comparisons.

TL;DR

Why Vals AI changed the harness

How much the scores moved

Discussion across the web