releaseMarch 9, 2026

Opposite-Narrator Contradictions benchmarks LLM sycophancy across 199 disputes

Lech Mazur released a controlled benchmark that swaps first-person narrators across the same dispute to test whether models agree with both sides, reject both sides, or stay consistent. Teams can use it to measure judgment stability under framing changes, not just headline accuracy.

Reliability Evals Benchmarks

3 min read

Opposite-Narrator Contradictions benchmarks LLM sycophancy across 199 disputes

TL;DR

Lech Mazur released Opposite-Narrator Contradictions, a sycophancy benchmark that keeps the facts fixed while swapping first-person perspective and emotional framing across the same dispute, with 199 verified cases and 995 prompts according to the launch thread.
The headline metric is strict: a model counts as sycophantic only when it favors both opposite narrators on the same dispute, while the companion analysis also tracks “contrarian contradictions” where the model rejects both sides, as Mazur explains in the follow-up thread.
Early leaderboard results in Mazur’s chart put Gemini 3.1 Pro Preview lowest on the headline sycophancy rate at 0.5%, with Grok 4.20 Reasoning Beta at 1.0% and GPT-5.4 medium reasoning at 2.0%.
But the longer breakdown argues that low sycophancy alone can mislead: Grok’s low rate comes with 60.9% INSUFFICIENT responses and just 28.1% decisive-pair coverage, while GPT-5.4 medium reasoning sits closer to the “sweet spot” between abstention and contradiction.

What does this benchmark actually test?

Lech Mazur

@LechMazur

·Follow

New! LLM Sycophancy Benchmark: Opposite-Narrator Contradictions. Same dispute, opposite first-person perspectives. Does the model keep the same judgment, or start agreeing with whoever is speaking? Gemini 3.1 Pro has the lowest headline sycophancy rate but read on...

2:45 AM · Mar 10, 2026

221

Read 13 replies

Mazur’s benchmark launch is designed to separate framing sensitivity from ordinary preference. Each dispute appears in five views: one neutral third-person version, two stripped first-person versions, and two affective first-person versions. The only thing that changes is who is telling the story and whether mild emotion is added.

That setup lets teams inspect three different failure modes instead of collapsing them into one score. In the follow-up, Mazur says the benchmark cleanly separates “baseline preference,” changes caused by first-person perspective alone, and extra movement caused by affective framing. The project page is published at the GitHub repo.

What do the first results mean for model evaluation?

Lech Mazur

@LechMazur

·Follow

Replying to @LechMazur

More info: github.com/lechmazur/syco…

2:45 AM · Mar 10, 2026

Read 1 reply

The results thread shows why a single leaderboard can hide deployment-relevant behavior. Gemini 3.1 Pro leads the headline chart, but Mazur says it “drops from #1 to #13” once contrarian contradiction is included. He also calls out that “plain first-person perspective already breaks consistency” for some models, with Mistral Large 3 reaching 31.2% contradiction before extra emotional wording is added.

The practical value is in the extra diagnostics. Mazur’s analysis reports Grok 4.20 Reasoning Beta as low-sycophancy but heavily abstaining, while GLM-5 reaches 93.0% decisive coverage at the cost of 12.1% contradiction. In the worked roommate example, different models split across stable cross-narrator judgments, FIRST/FIRST sycophancy, and OTHER/OTHER contrarian behavior, giving eval teams a reproducible way to test judgment stability under perspective swaps rather than relying on headline accuracy alone.