Skip to content
AI Primer
breaking

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.

3 min read
LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates
LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

TL;DR

  • Lech Mazur’s benchmark launch introduces a new LLM Debate Benchmark with 21 models and 1,162 debates run on side-swapped versions of the same motion, where Sonnet 4.6 high ranks first ahead of GPT-5.4 high and GLM-5 leads the open-weights pack.
  • The benchmark is aimed at adversarial, multi-turn performance rather than one-shot answer quality: as the benchmark thread puts it, models need “strong rebuttal” and the ability to stay “coherent, responsive, and defensible over several rounds.”
  • Methodologically, the format details matter: each debate has 10 turns, rankings use Bradley-Terry over side-swapped matchups, and the judging setup uses three LLM judges sampled from a six-model pool while avoiding same-family judging.
  • Engineers reading the leaderboard should also inspect refusal behavior and artifacts, because the coverage note reports content-block rates ranging from 10.4% for Xiaomi MiMo V2 Pro to 3.8% for Grok 4.20 Beta 0309, and the project exposes charts, transcripts, and code through the release thread and the repo.

What does this benchmark add beyond normal evals?

This eval tests debate, not just answer generation. According to the benchmark thread, it targets “adversarial, multi-turn debates,” where models must combine factual recall, rebuttal quality, and consistency across several rounds. That is closer to how judge models, debate agents, and structured planning systems fail in production than a static single-turn benchmark.

The task mix is broad enough to stress generality rather than a narrow policy niche. Mazur’s coverage note says the set spans 683 curated motions, from shrinkflation labeling to eurozone fallout and dating-app market structure. The released examples in quotable lines show why this is interesting for practitioners: models are being judged on whether they can produce concise, defensible claims such as “You cannot reject a trap you cannot see” or “It is exclusion protected by aesthetics,” not just retrieve facts.

How was it built, and what should engineers check before trusting the ranking?

The design has a few strong controls. Mazur’s format details says each debate runs 10 turns with openings, two rebuttals, a pressure-question exchange, and closings. Rankings are computed with Bradley-Terry over side-swapped matchups, which helps control for topic asymmetry. Completed debates are judged by three LLM judges drawn from six models, and the judging setup says same-family judging against the debaters is excluded.

The leaderboard still needs to be read with caveats. The coverage note includes content-block rates, which can change effective participation and comparability across motions. The sample transcript in sample debate also shows how much the results depend on prompt format and interaction structure: a Sonnet 4.6 adaptive vs. GPT-5.4 high match turns on whether consent creates a meaningful “paper trail” or just “consent theater,” and even the judge notes split across winners. That makes judge composition and transcript inspection part of the eval, not an afterthought.

The project is unusually inspectable. The long-form posts in full thread mirror and release thread point to charts, transcripts, model profiles, reports, judgments, and the GitHub repo; repo link post separately surfaces the code link. For teams building debate, arbitration, or self-critique systems, that makes this more useful as a reproducible eval pattern than as a simple winner board.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR4 posts
What does this benchmark add beyond normal evals?2 posts
How was it built, and what should engineers check before trusting the ranking?4 posts
Share on X