breakingMarch 23, 2026

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, aligning results more closely with the official benchmark setup. Top score dipped slightly to 78.8%, but the change reduces harness-specific confounds when comparing models.

Coding Agents Evals Benchmarks

2 min read

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

TL;DR

Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, with Vals AI's announcement framing the change as a cleaner way to measure model capability.
The reason for the swap is methodological: the thread says richer harnesses can boost scores but also "confound underlying model capabilities" with harness-specific adaptation.
Vals says its mini-swe-agent note now aligns the eval with the official SWE-bench leaderboard's default harness and keeps agents to "standard command line tools."
The scoreboard barely moved after the switch: according to Vals AI's results update, the top score slipped from 79.2% to 78.80% while the average rose from 63.8% to 65.9%.