workflowMarch 20, 2026

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

OpenHands published a skill-eval recipe with bounded tasks, deterministic verifiers, and no-skill baselines, then showed some skills speed agents up while others make them brittle. Teams shipping skill libraries should measure them per task and model before rollout.

3 min read

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

TL;DR

OpenHands' skills eval thread argues that teams shipping agent skills need a real baseline: each evaluation should use a bounded task, a deterministic pass/fail verifier, and a no-skill comparison, because without that baseline "success tells you almost nothing" evaluation recipe.
In OpenHands' dependency audit result, one skill was clearly worth keeping: the dependency-audit task went from 0% pass rate without the skill to 100% with it, while runtime fell from 266 seconds to 109 seconds.
The financial extraction result shows a weaker case for skills as guardrails rather than breakthroughs: models were already passing 90% of the time, and the skill only pushed that to 100% by adding formulas and Python-use instructions.
OpenHands' sales pivot result also found failure modes: overall pass rate rose from 70% to 80%, but some models regressed because the skill pushed them into a "brittle path," reinforcing that skill quality is task-dependent and model-dependent blog and tutorial.

What did OpenHands actually propose for skill evaluation?

OpenHands' core claim is simple: a skill is not useful just because an agent completes a task after you add it. Its evaluation recipe says a credible test needs three parts: "a bounded task," "a deterministic verifier," and "a no-skill baseline." That setup is meant to isolate whether the skill changed outcomes, rather than rewarding prompt stuffing or unconstrained demos.

The team packaged that into a public tutorial repo and a blog post that walk through running the same task with and without a skill. The examples span dependency audits, financial-report extraction, and sales analysis, which makes the project more of an evaluation recipe than a single benchmark blog and tutorial.

Where did skills help, and where did they hurt?

The strongest result came from the dependency-audit task. In OpenHands' dependency audit result, the agent had to inspect a package-lock.json and produce a vulnerability report. Without the skill, pass rate was 0%. With it, pass rate hit 100%, and runtime dropped by more than half, from 266 seconds to 109 seconds. OpenHands' explanation is that some skills are "essential" because they encode the workflow the task actually requires.

A second task showed a much smaller gain. The financial extraction result says most models already passed financial-report extraction 90% of the time without help. Adding a skill that supplied exact formulas and told the agent to use Python for arithmetic pushed that to 100%. That looks less like new capability than a guardrail for occasional arithmetic or procedure errors.

The most useful result for engineering teams may be the negative one. In the sales-pivot analysis task, overall pass rate improved from 70% to 80%, but OpenHands says the effect "varied by model" and that one skill made at least one model less reliable by nudging it into a brittle execution path sales pivot result. That is the practical warning behind the whole release: skill libraries need per-task, per-model measurement before rollout, not blanket assumptions that more scaffolding helps.

TL;DR

What did OpenHands actually propose for skill evaluation?

Where did skills help, and where did they hurt?

Discussion across the web