TOPIC50 stories

Evals

Evaluation systems for models, agents, and AI products.

Stories

GLM-5.2 ranks 30/99 on PrinzBench as testers report legal hallucinations

PrinzBench added GLM-5.2 and scored it 30/99 for legal research, while a separate LisanBench run placed GLM-5.2-high at #29 and noted high token use. The result matters because it cuts against code-centric GLM hype and points to weak search, statute fidelity, and reasoning on professional legal tasks.

RELEASE26th June

Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score

Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

NEWS25th June

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.

NEWS22nd June

Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Vals AI launched SkillsBench, a public benchmark for measuring how reusable skills change coding-agent performance, and reported average accuracy rising from 35.5% to 52.5%. The results matter because they suggest some workflows can move to cheaper models when task-specific skills are available.

WORKFLOW21st June

Human-on-the-Bridge compares reusable eval assets with LLM judges and human review

A new Human-on-the-Bridge paper argued for front-loading expert judgment into reusable evaluation assets, while practitioners also shared double-run and multi-model review setups. The cluster matters because teams tuning agent harnesses need repeatable ways to measure behavior beyond one-off benchmark scores or subjective PR review.

NEWS1w ago

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.

NEWS1w ago

OpenAI reports beneficial RL improves 44 of 53 evals and transfers beyond health

OpenAI said reinforcement learning on realistic conversations improved 44 of 53 alignment and benefit evaluations, including transfer from health-only training to deception and reward-hacking tests. The result suggests a broader behavioral shift rather than narrow task tuning, but the claim is based on OpenAI’s own eval mix rather than a single public benchmark.

NEWS1w ago

Artificial Analysis launches AA-Briefcase with Claude Fable 5 at 1587 Elo

Artificial Analysis launched AA-Briefcase, a benchmark for multi-week knowledge-work projects with thousands of source files, and Claude Fable 5 leads at 1587 Elo. The first results show a wide cost spread, so teams should compare both quality and task cost before choosing a model.

NEWS1w ago

GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7

Fresh third-party results put GLM-5.2 atop multiple open-model leaderboards, including the AA Coding Index, Vals Index, Terminal Bench 2.1, and Design Arena. The scores add independent confirmation, though demand spiked enough to strain some providers.

NEWS1w ago

Anthropic reports Claude Code task success stays within 7 points of software engineering across occupations

Anthropic published data from 400,000 Claude Code sessions, finding average task value rose 27% and verifiable success across occupations stayed within seven points of software engineering. The report gives teams a concrete baseline for where coding agents already generalize and where domain expertise still changes outcomes.

RELEASE1w ago

TryCua launches Cua-Bench for KiCad; GPT-5.5 clears 6 of 25 tasks

TryCua and Snorkel opened Cua-Bench, a computer-use benchmark with 25 expert-authored KiCad tasks graded by exact netlist matches. The early results show frontier models still struggle with GUI execution, wiring completion, and self-checking, so treat benchmark wins as incomplete for real computer-use work.

NEWS2w ago

Vals ranks Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1

Vals posted new external results for Kimi K2.7 Code, ranking it the top open-weight model on SWE-bench and Terminal-Bench 2.1. The results give Moonshot's launch claims an outside benchmark line on repo and terminal-heavy tasks.

RELEASE2w ago

Goodfire introduces predictive data debugging with R² 0.9 DPO forecasts

Goodfire said its predictive debugging can forecast DPO-driven behavior shifts with R² 0.9 before training and trace them to individual preference pairs. Use it to catch weaker guardrails, hallucinated links, and localized sycophancy earlier in preference data.

NEWS2w ago

Cognition benchmarks FrontierCode: top model scores 13% with mergeability grading

Cognition introduced FrontierCode, a coding benchmark that grades mergeability and review quality instead of only unit-test passes, and the top model scored 13%. The result matters because it differs from SWE-Bench-style pass rates, and outside researchers are already questioning score variance and reproducibility.

NEWS3w ago

Researchers benchmark AutoLab, SkillOpt, and Meta-Agent Challenge for self-improving agents

New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.

NEWS3w ago

MIT study reports 300% more files but 30% more releases after AI coding adoption

MIT-linked analysis says AI coding tools sharply raise local code output, but most of the gain disappears by review and release. Teams should watch downstream throughput, since project creation rose without matching demand signals in separate Hugging Face Spaces data.

NEWS3w ago

Kilo Code benchmarks MiniMax M3 vs Claude Opus 4.8: 13/17 bugs at $0.07 vs $1.30

A seeded code-audit benchmark found MiniMax M3 and the cheapest Claude Opus 4.8 run each caught 13 of 17 planted bugs, but at sharply different cost. The results also showed models found different bugs, and higher reasoning settings did not reliably improve cost efficiency.

NEWS3w ago

Anthropic reports Claude wrote 80% of merged code

Anthropic published internal metrics showing Claude wrote 80% of merged code, with 8x engineer output and 52x training-code speedups in Mythos Preview. The post matters because it gives a rare lab-side look at AI-assisted engineering gains, while still saying research judgment remains a bottleneck and recursive self-improvement is unproven.

NEWS3w ago

Arena launches Agent Mode rankings with GPT-5.5 High leading

Arena shipped Agent Mode, a benchmark that lets models use web search, bash, file writing, image generation, and follow-up questions, then ranks them on five live-session signals. It matters because agent evals move from static task sets to real user workflows, with GPT-5.5 High currently leading the leaderboard.

NEWS3w ago

Cognition launches Devin Productivity Guarantee with $10M cap

Cognition said it will fund Devin usage up to $10 million when measured engineering value falls below cost, and published a technical writeup estimating productive engineering hours per session. It matters because the company is shifting agent pricing from tokens to claimed output and extending coding evaluation toward much longer task horizons.

NEWS3w ago

Vals launches ProgramBench: Opus 4.8 solves 2 of 200 software-reconstruction tasks

Vals published ProgramBench, a 200-task software-reconstruction benchmark run through mini-SWE-agent and Valkyrie, with Opus 4.8 becoming the first model to fully solve two tasks. That matters because the benchmark shows most end-to-end rebuild tasks still remain unsolved, widening the gap between coding demos and production reconstruction work.

NEWS3w ago

MiniMax M3 users report slow runs and broken code after launch

A day after MiniMax M3 launched, independent testers posted mixed results: cheap demos and design tasks worked, but several coding runs stalled, broke features, or used more tokens than expected. New external numbers added nuance, with Context Arena falling sharply after 64k context and one DeepSWE run passing 15 of 113 tasks.

RELEASE4w ago

Prime Intellect launches Hosted Evaluations with harnesses, sandboxes, and rollouts viewer

Prime Intellect launched Hosted Evaluations to manage harnesses, sandboxes, and rollout inspection for model testing. The service packages eval infrastructure while still supporting local runs against arbitrary engines, so teams can centralize testing without losing flexibility.

RELEASE4w ago

DeepSWE benchmarks GPT-5.5 at 70% on 113 tasks across 91 repos

DeepSWE launched a coding benchmark built from 113 original tasks across 91 repos and five languages, with GPT-5.5 leading at 70%. The setup is meant to better reflect repo search, multi-file edits, and verification in real agent workflows.

NEWS4w ago

Tax AI reports 97% accuracy across 7,000 returns at 30+ accounting firms

OpenAI and Thrive described Tax AI, a self-improving tax-prep system used across 30+ firms that processed 7,000 returns and reached up to 97% accuracy. The loop turns accountant corrections into eval targets and narrow Codex fixes, showing a concrete path to vertical agents that improve after deployment.

RELEASE4w ago

Hyperbrowser launches AgentRank to test Claude, GPT, and Gemini on real websites

Hyperbrowser launched AgentRank, an open-source tool that runs Claude, GPT, and Gemini agents against a site to show where they get stuck. It matters because teams can turn agent website compatibility into a repeatable eval instead of an anecdotal demo.

NEWS4w ago

Report: Claude Mythos reportedly solves Erdős problem #90 in air-gapped test

Anthropic staff and outside observers said a Mythos-powered Claude Code setup solved Erdős problem #90 in an internet-blocked test. The result is still based on harnessed runs and social-thread disclosures, so watch for fuller verification before treating it as settled.

WORKFLOW4w ago

Microsoft benchmarks SkillOpt at +24.8 Codex points by editing skills, not weights

Microsoft Research released SkillOpt, which optimizes external skill files instead of fine-tuning model weights and reports best-or-tied results across 52 evaluation cells. The method matters because it improved Codex and Claude Code accuracy without extra inference-time calls.

NEWS1mo ago

Google DeepMind reports AlphaProof Nexus solved 9 Erdős problems with Lean verification

A new paper says AlphaProof Nexus resolved 9 of 353 open Erdős problems and 44 OEIS conjectures using Gemini-guided search plus Lean checks. The strongest results came where Lean libraries are already mature, so those libraries remain the bottleneck to watch.

NEWS1mo ago

OpenAI reports internal reasoning model disproves Erdős's 1946 unit-distance conjecture

OpenAI said an internal general-purpose reasoning model disproved Erdős's 1946 unit-distance conjecture without a math-specific scaffold or Lean. If the linked proof and expert commentary hold up, it shifts frontier-model discussion toward original research, not just benchmark performance.

RELEASE1mo ago

Gemini 3.5 Flash ships with 76.2% Terminal-Bench 2.1 and $1.50/$9 pricing

Google shipped Gemini 3.5 Flash as a GA model with 1M context, 65K max output, and stronger agentic benchmarks than Gemini 3.1 Pro. Watch task-level cost, since third-party evals show it can exceed Gemini 3.1 Pro and GPT-5.5 Medium on some jobs.

WORKFLOW1mo ago

Workshop launches local trace inspector with 1-line install and Codex-readable logs

Workshop open-sourced a local agent debugging tool that exposes traces for humans, Codex, and Claude Code with a one-line install and GitHub repo. It turns agent runs into something teams can inspect and reuse for evals instead of treating terminal sessions as black boxes.

RELEASE1mo ago

SophontAI releases Medmarks v1.0 with 30 medical benchmarks and 61 models

SophontAI released Medmarks v1.0, expanding its open medical LLM evaluation suite to 30 benchmarks and 61 models alongside a technical report. It gives teams a larger open baseline for medical post-training and model selection, with more benchmarks and model coverage still planned.

RELEASE1mo ago

Sentence Transformers 5.5.0 adds train-sentence-transformers skill with one-shot 0.8856 NDCG@10

Sentence Transformers 5.5.0 adds an agent skill for fine-tuning embeddings, rerankers, and sparse encoders from Claude Code, Codex, Cursor, and Gemini CLI. The author reports a one-shot German embedding run rising from 0.6720 to 0.8856 NDCG@10 on a local PC.

NEWS1mo ago

Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.

RELEASE1mo ago

ERNIE 5.1 Preview ranks No. 4 on Search Arena and claims 6% pretraining cost

Baidu pushed ERNIE 5.1 Preview with new leaderboard claims, including No. 4 on Search Arena and No. 13 on LMArena Text. Treat the 6% pretraining cost claim cautiously until an independent technical report confirms it.

NEWS1mo ago

METR says Claude Mythos Preview hits 16-hour p50 Horizon in early snapshot

METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.

NEWS1mo ago

Anthropic introduces Natural Language Autoencoders for Claude activations

Anthropic introduced Natural Language Autoencoders, a two-model method that translates Claude activations into text explanations and reconstructs them back. The system exposed hidden rhyme planning and evaluation awareness in Claude, but Anthropic says the explanations are useful rather than guaranteed faithful.

RELEASE1mo ago

Ramp Sheets launches Fast Ask RL subagent with +4% exact-match gain over Opus at Haiku latency

Ramp and Prime Intellect launched Fast Ask, a small RL-trained spreadsheet retrieval subagent for Ramp Sheets. Ramp says it beats Opus by 4% exact match while running at Haiku latency, showing how narrow RL-trained agents can outperform larger frontier models on repetitive enterprise tasks.

NEWS1mo ago

ProgramBench reports 0% on ffmpeg, SQLite, and ripgrep rebuilds without internet

The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.

NEWS1mo ago

Goodfire reports eval awareness raises Fortress refusals 16% and cuts StereoSet stereotypes 20%

Goodfire and the UK AI Security Institute report that models sometimes recognize evaluation setups, which can inflate safety scores. Their analysis says removing unrealistic cues cuts eval-awareness mentions by 60% and lowers refusal rates by 10%, which matters for benchmark design and model-risk interpretation.

NEWS1mo ago

ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3

ARC Prize published frontier-model results on ARC-AGI-3 and said GPT-5.5 and Opus 4.7 both stayed below 1%, with failures in world modeling, abstraction, and reward reinforcement. That shows strong coding and benchmark models still break on novel interactive reasoning tasks, and follow-up comparisons even had Opus 4.6 slightly ahead of 4.7.

NEWS1mo ago

ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.

NEWS2mo ago

Plurai introduces vibe-training with sub-100ms agent guardrails and 43% fewer failures

Plurai launched vibe-training to turn natural-language intents into task-specific eval and guardrail APIs backed by small models. That matters because it positions SLM-based checks as a faster, cheaper alternative to frontier LLM judges for production agents.

NEWS2mo ago

Users report GPT-5.5 speeds up coding and cuts over-editing in low-reasoning runs

New evals and day-three user tests show GPT-5.5 performing well at low or medium reasoning, with benchmark gains over GPT-5.4 in coding-heavy use. That matters because stronger results no longer require xhigh runs, though some users still flag sycophancy.

NEWS2mo ago

OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview

OpenAI introduced a free ChatGPT tier for verified U.S. clinicians and released HealthBench Professional, an open benchmark built from real clinical chat tasks. The launch pairs a clinician-facing workflow product with a public evaluation set and published model results.

NEWS2mo ago

AISI reports Claude Mythos completes a 32-step corporate attack range

Anthropic's Mythos system card says the model completed the AI Security Institute's 32-step corporate attack range in about 20 human hours. The benchmark matters as a cyber capability signal, but the range is easier than a real defended enterprise network.

NEWS2mo ago

Meerkat reports harness-level cheating across 28+ submissions on nine agent benchmarks

Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.

NEWS2mo ago

MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation

Epoch AI and METR introduced MirrorCode, a long-horizon benchmark where models reimplement software from execution-only access; Opus 4.6 completed a 16,000-line bioinformatics toolkit. The authors say oracle tests and memorization risks still limit how directly the result maps to everyday software work.

RELEASE2mo ago

Meta releases Muse Spark with 52 AA score and 58.4% HLE

Meta released Muse Spark, the first model from Meta Superintelligence Labs, with multimodal reasoning, tool use, and a parallel-agent Contemplating mode. Access stays limited to Meta AI and private API preview, so watch for broader availability before planning production use.