breakingMarch 23, 2026

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.

Claude Evals LLM as Judge Benchmarks

3 min read

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

TL;DR

Lech Mazur’s benchmark launch introduces a new LLM Debate Benchmark with 21 models and 1,162 debates run on side-swapped versions of the same motion, where Sonnet 4.6 high ranks first ahead of GPT-5.4 high and GLM-5 leads the open-weights pack.
The benchmark is aimed at adversarial, multi-turn performance rather than one-shot answer quality: as the benchmark thread puts it, models need “strong rebuttal” and the ability to stay “coherent, responsive, and defensible over several rounds.”
Methodologically, the format details matter: each debate has 10 turns, rankings use Bradley-Terry over side-swapped matchups, and the judging setup uses three LLM judges sampled from a six-model pool while avoiding same-family judging.
Engineers reading the leaderboard should also inspect refusal behavior and artifacts, because