breakingMarch 23, 2026

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

OpenHands introduced EvoClaw, a benchmark that reconstructs milestone DAGs from repo history to test continuous software evolution instead of isolated tasks. The first results show agents can clear single tasks yet still collapse under regressions and technical debt over longer runs.

3 min read

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

TL;DR

OpenHands introduced EvoClaw as a benchmark for “continuous software evolution,” with the launch thread framing the core problem as whether agents can keep a real codebase healthy over time rather than just finish isolated coding tasks.
According to the milestone post, EvoClaw rebuilds milestone DAGs from real repository history because single commits are “too noisy” and release-sized chunks are “too coarse,” making each task executable and dependent on prior work.
The first leaderboard numbers in the results post show a sharp drop from isolated-task performance above 80% to a best continuous-evolution score of 38.03%, with Claude Opus 4.6 plus OpenHands leading overall and Gemini 3 Pro plus Gemini CLI posting the highest resolve rate at 13.37%.
OpenHands says in the failure analysis and the error-chain post that long runs stall less because agents stop adding features and more because precision flattens early, letting regressions, technical debt, and unresolved failures propagate through later milestones.

What is EvoClaw actually testing?

OpenHands' launch thread positions EvoClaw as a shift away from one-shot coding evals and toward the harder question of codebase maintenance across evolving requirements. The benchmark is built with OpenHands and evaluates agents on sequences of changes derived from real repositories rather than synthetic standalone tickets.

The design choice in the milestone post is to reconstruct milestone DAGs from repo history, so tasks are “meaningful, executable, and dependent on what came before.” That gives the benchmark a way to score not just whether an agent can land a patch, but whether it can extend a project without breaking earlier work. OpenHands' blog, paper, and leaderboard expand that setup beyond the thread.

Why do the first results matter for coding agents?

The headline number from the results post is that continuous evolution is much harder than isolated tasks: scores that can exceed 80% on single tasks fall to a best overall score of 38.03% once agents have to operate across dependent milestones. OpenHands also separates “overall score” from “resolve rate,” with Claude Opus 4.6 plus OpenHands topping the former and Gemini 3 Pro plus Gemini CLI reaching the highest resolve rate at 13.37%.

The failure mode matters as much as the score. In the failure analysis, OpenHands says recall “keeps climbing,” meaning agents still add requested functionality, but precision “saturates much earlier,” so regressions accumulate faster than they can be repaired. The error-chain post adds that unresolved failures then propagate downstream through milestone dependencies, which makes long-horizon coding look less like a code-generation problem and more like a system-health problem. OpenHands' resource roundup calls EvoClaw “the benchmark to watch” for long-running coding agents, mainly because it exposes this maintenance gap directly.

TL;DR

What is EvoClaw actually testing?

Why do the first results matter for coding agents?

Discussion across the web