Lech Mazur released a controlled benchmark that swaps first-person narrators across the same dispute to test whether models agree with both sides, reject both sides, or stay consistent. Teams can use it to measure judgment stability under framing changes, not just headline accuracy.

Mazur’s benchmark launch is designed to separate framing sensitivity from ordinary preference. Each dispute appears in five views: one neutral third-person version, two stripped first-person versions, and two affective first-person versions. The only thing that changes is who is telling the story and whether mild emotion is added.
That setup lets teams inspect three different failure modes instead of collapsing them into one score. In the follow-up, Mazur says the benchmark cleanly separates “baseline preference,” changes caused by first-person perspective alone, and extra movement caused by affective framing. The project page is published at the GitHub repo.
The results thread shows why a single leaderboard can hide deployment-relevant behavior. Gemini 3.1 Pro leads the headline chart, but Mazur says it “drops from #1 to #13” once contrarian contradiction is included. He also calls out that “plain first-person perspective already breaks consistency” for some models, with Mistral Large 3 reaching 31.2% contradiction before extra emotional wording is added.
The practical value is in the extra diagnostics. Mazur’s analysis reports Grok 4.20 Reasoning Beta as low-sycophancy but heavily abstaining, while GLM-5 reaches 93.0% decisive coverage at the cost of 12.1% contradiction. In the worked roommate example, different models split across stable cross-narrator judgments, FIRST/FIRST sycophancy, and OTHER/OTHER contrarian behavior, giving eval teams a reproducible way to test judgment stability under perspective swaps rather than relying on headline accuracy alone.
New! LLM Sycophancy Benchmark: Opposite-Narrator Contradictions. Same dispute, opposite first-person perspectives. Does the model keep the same judgment, or start agreeing with whoever is speaking? Gemini 3.1 Pro has the lowest headline sycophancy rate but read on...