workflowMarch 20, 2026

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

OpenHands published a skill-eval recipe with bounded tasks, deterministic verifiers, and no-skill baselines, then showed some skills speed agents up while others make them brittle. Teams shipping skill libraries should measure them per task and model before rollout.

OpenHands Coding Agents Evals

3 min read

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

TL;DR

OpenHands' skills eval thread argues that teams shipping agent skills need a real baseline: each evaluation should use a bounded task, a deterministic pass/fail verifier, and a no-skill comparison, because without that baseline "success tells you almost nothing" evaluation recipe.
In OpenHands' dependency audit result, one skill was clearly worth keeping: the dependency-audit task went from 0% pass rate without the skill to 100% with it, while runtime fell from 266 seconds to 109 seconds.
The financial extraction result shows a weaker case for skills as guardrails rather than breakthroughs: models were already passing 90% of the time, and the skill only pushed that to 100% by adding formulas and Python-use instructions.
OpenHands'