Using models to score or review model outputs.
HN follow-up on Stanford's sycophancy study focused on mitigations like confidence scores, compare-and-contrast prompting, and separate evaluator agents. Commenters argued the same failure mode can distort coding and architecture decisions, not just personal advice, so teams should watch for overconfident agent output.
LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.