breakingMarch 23, 2026

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

OpenHands introduced EvoClaw, a benchmark that reconstructs milestone DAGs from repo history to test continuous software evolution instead of isolated tasks. The first results show agents can clear single tasks yet still collapse under regressions and technical debt over longer runs.

OpenHands Coding Agents Evals Benchmarks

3 min read

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

TL;DR

OpenHands introduced EvoClaw as a benchmark for “continuous software evolution,” with the launch thread framing the core problem as whether agents can keep a real codebase healthy over time rather than just finish isolated coding tasks.
According to the milestone post, EvoClaw rebuilds milestone DAGs from real repository history because single commits are “too noisy” and release-sized chunks are “too coarse,” making each task executable and dependent on prior work.
The first leaderboard numbers in the results post show a sharp drop from isolated-task performance above 80% to a best continuous-evolution score of 38.03%, with Claude Opus 4.6 plus OpenHands leading overall and Gemini 3 Pro plus Gemini CLI posting the highest resolve rate at 13.37%.
OpenHands says in the failure analysis and