Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, aligning results more closely with the official benchmark setup. Top score dipped slightly to 78.8%, but the change reduces harness-specific confounds when comparing models.

Vals AI changed the harness to reduce benchmark-specific scaffolding in its SWE-Bench Verified runs. In the follow-up post, the team says more complex harnesses can improve results, but that can blur whether a model is genuinely better at repo repair or simply better tuned to a particular agent stack.
The replacement is deliberately narrower. Vals AI's explanation describes mini-swe-agent as a "neutral evaluation setup" that tests models using only standard command-line tools, and says that choice also brings Vals closer to the official SWE-bench leaderboard's default harness. For engineers comparing coding models across leaderboards, that makes Vals' numbers easier to map to the benchmark's baseline setup.
The harness change did not materially reshuffle results. Vals reports in its results update that performance changed by only "a few percentage points for most providers," with the best score edging down from 79.2% to 78.80%.
The more surprising change was in the middle of the pack. The same results update says the average score increased from 63.8% to 65.9%, which suggests the simpler harness did not uniformly depress outcomes across vendors.
Vals says in the closing post that the full results are available on its website, but the thread's headline takeaway is narrower: switching to a bash-only harness changed the absolute top line only slightly while reducing one source of harness-specific variance in SWE-Bench Verified comparisons.
LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Overall, performance only changed by a few percentage points for most providers. We saw the top score decrease from 79.2% to 78.80%, but the average score actually increased slightly, from 63.8% to 65.9%.
We have switched our SWE-Bench Verified harness from SWE-Agent to mini-swe-agent, a bash-only agent.
As always, full results can be found on the Vals AI website.