Skip to content
AI Primer
TOPIC50 stories

Benchmarks

Benchmark suites, leaderboard caveats, and task measurement.

RELEASE27th June
DeepSeek V4-Pro benchmarks at ~90 tok/s after DSpark rollout

Independent measurements after DSpark put DeepSeek V4-Pro around 90 tok/s and cut one run from 214s to 116s. The gain matters because it lowers serving cost, though tuning details and memory overhead are still unclear.

RELEASE27th June
Datalab ranks 95.9% on a 225-document extraction benchmark at under half Reducto's price

Datalab’s balanced extraction mode scored 95.9% on a 225-document benchmark and beat Reducto Deep Extract’s 95.1%, according to Vik Paruchuri. The update also adds citations and reasoning, but the benchmark and price comparison are vendor-reported.

NEWS27th June
GLM-5.2 ranks 30/99 on PrinzBench as testers report legal hallucinations

PrinzBench added GLM-5.2 and scored it 30/99 for legal research, while a separate LisanBench run placed GLM-5.2-high at #29 and noted high token use. The result matters because it cuts against code-centric GLM hype and points to weak search, statute fidelity, and reasoning on professional legal tasks.

NEWS26th June
Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts

Chandra's developer said Mistral OCR 4 launch numbers for both Chandra and OCR 4 could not be reproduced with public code, and published scripts to show the gaps. The dispute matters because Mistral OCR 4 launched on leaderboard claims, and benchmark settings now directly affect model selection.

RELEASE26th June
Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score

Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

RELEASE25th June
OpenRouter launches MCP server with live pricing, benchmarks, and test inference

OpenRouter released an MCP server that lets agents query live model pricing, benchmark scores, provider data, docs, and run test inference from the CLI. That replaces stale model knowledge with current routing data inside long-running agent workflows.

RELEASE25th June
DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

NEWS25th June
Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.

RELEASE24th June
Baidu releases Unlimited OCR with 3B params for single-pass long documents

Baidu released Unlimited OCR as an open-source long-document OCR model with 3B total parameters and 500M active at inference. Early ParseBench testing says it is strong on tables and reading order but weaker on semantic formatting and charts, giving teams a new open-weight OCR option with clear tradeoffs.

NEWS22nd June
GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

GLM-5.2 added Perplexity Agent API, Droid, and more hosting options, while Baseten reported over 280 TPS and sub-0.8s TTFT. Builders should watch the cost and benchmark data as it moves into production agent stacks.

NEWS22nd June
Fugu Ultra testers report 30-minute runs and 17x GLM cost after launch

Sakana launched Fugu Ultra on AI Gateway and published a technical report, with early testers sharing mixed results. Reports mention polished outputs on some tasks, but also 30-minute runs, uneven coding quality, and much higher cost than GLM-5.2.

NEWS22nd June
Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Vals AI launched SkillsBench, a public benchmark for measuring how reusable skills change coding-agent performance, and reported average accuracy rising from 35.5% to 52.5%. The results matter because they suggest some workflows can move to cheaper models when task-specific skills are available.

NEWS1w ago
GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.

NEWS1w ago
Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price

Independent tests put GLM-5.2 near Opus 4.8 and GPT-5.5 on planning and coding, and users shared Claude Code, BrowserCode, dcode, and local-serving recipes. It matters because many engineers are treating it as a daily-driver option for text-heavy coding, though teams still report weaker vision and provider limits.

NEWS1w ago
ComputeSDK releases 2026 100k Scale Invitational results across 6 sandbox providers

ComputeSDK published results from its 2026 100k Scale Invitational after weeks of reruns and infra tuning across Modal, Tensorlake, Northflank, Declaw AI, E2B, and Isorun. It matters because sandbox and agent infra claims now have a shared public concurrency target instead of vendor-specific load demos.

RELEASE1w ago
lift-pdf releases 9B extractor with 90.2% accuracy and 9.5s p50

lift-pdf released an open-source 9B model for schema-constrained document extraction, with code, pip install, playground access, and a 90.2% score on the team's 225-document bench. It matters because the model claims near-Gemini 3.5 Flash accuracy at 9.5s p50, though coverage is still skewed toward Latin-language docs and commercial-use limits remain.

RELEASE1w ago
Kilo Code adds Terminal Bench scores and average attempt cost to model picker

Kilo Code now shows Terminal Bench completion rate and average attempt cost directly in model details inside its CLI and VS Code extension. It matters because the numbers come from Kilo's own harness and retry logic rather than public leaderboard scaffolds.

RELEASE1w ago
Poolside releases Laguna M.1 open weights with 225B MoE and 256K context

Poolside released Apache 2.0 weights for Laguna M.1 and XS.2, its long-horizon coding models, with M.1 shipping at 225B total parameters, 23B active, and 256K context. SGLang and vLLM support on day one lets teams run and fine-tune the models in existing agent stacks immediately.

NEWS1w ago
Artificial Analysis launches AA-Briefcase with Claude Fable 5 at 1587 Elo

Artificial Analysis launched AA-Briefcase, a benchmark for multi-week knowledge-work projects with thousands of source files, and Claude Fable 5 leads at 1587 Elo. The first results show a wide cost spread, so teams should compare both quality and task cost before choosing a model.

NEWS1w ago
GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7

Fresh third-party results put GLM-5.2 atop multiple open-model leaderboards, including the AA Coding Index, Vals Index, Terminal Bench 2.1, and Design Arena. The scores add independent confirmation, though demand spiked enough to strain some providers.

RELEASE1w ago
Z.ai releases GLM-5.2 open weights with 1M context and 46.2% DeepSWE

Z.ai released GLM-5.2 MIT-licensed open weights with 1M context and broad runtime support. Vendor and arena results put it near frontier closed models on long-horizon coding.

RELEASE1w ago
Moonshot releases Kimi K2.7 Code HighSpeed at 180 tok/s with 2x API pricing

Moonshot rolled out HighSpeed for Kimi K2.7 Code, claiming about 180 tok/s on coding tasks, up to 260 tok/s on shorter contexts, and roughly 6x speedups. Watch the tight capacity limits and mixed benchmark results, and budget for the 2x pricing if you want the faster mode.

RELEASE1w ago
TryCua launches Cua-Bench for KiCad; GPT-5.5 clears 6 of 25 tasks

TryCua and Snorkel opened Cua-Bench, a computer-use benchmark with 25 expert-authored KiCad tasks graded by exact netlist matches. The early results show frontier models still struggle with GUI execution, wiring completion, and self-checking, so treat benchmark wins as incomplete for real computer-use work.

RELEASE2w ago
OpenRouter launches Fusion API with model panels and judge routing

OpenRouter launched Fusion, a server-side panel API that sends prompts to multiple models and combines one answer. Early logs also showed a web-path issue where Fusion still invoked Claude Opus 4.8 as judge and billed for it until API-side control was clarified.

RELEASE2w ago
GLM-5.2 ranks #1 on BridgeBench Reasoning at 42.8

GLM-5.2 opened to GLM Coding Plan users and posters claimed #1 BridgeBench scores in BS and Reasoning, with one post citing 1/10th the cost and 300 tokens per second. Early frontend tests still found a gap to Fable 5 and Opus on finer visual details.

NEWS2w ago
Fable users compare GLM-5.2, GPT-5.5, and model panels on one-shot UI work

Two days after Fable 5 went offline, developers started testing GLM-5.2, GPT-5.5, and multi-model panels against the kinds of one-shot frontend and greenfield builds Fable handled well. The early pattern is that replacements cover much of the work, but Fable still leads on UI taste and first-pass product completion.

NEWS2w ago
Together AI ranks DeepSeek V4 Pro #1 on Artificial Analysis latency and speed

Together AI said its DeepSeek V4 Pro deployment now leads Artificial Analysis on both output speed and latency. The claim matters because it turns V4 serving into an inference-systems story about KV cache reuse, prefix reuse, kernels, and endpoint profiles rather than model weights alone.

RELEASE2w ago
Z.ai releases GLM-5.2 for Coding Plan users with 1M context and Max mode

Z.ai made GLM-5.2 available to GLM Coding Plan users with High and Max thinking modes, 1M context, and promised API plus MIT open source next week. Early testers reported higher plan pricing, heavy rate limits, and mixed build quality versus Opus and Fable.

NEWS2w ago
Vals ranks Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1

Vals posted new external results for Kimi K2.7 Code, ranking it the top open-weight model on SWE-bench and Terminal-Bench 2.1. The results give Moonshot's launch claims an outside benchmark line on repo and terminal-heavy tasks.

RELEASE2w ago
Moonshot releases Kimi K2.7 Code: +21.8% on Kimi Code Bench v2, 30% fewer reasoning tokens

Moonshot open-sourced Kimi K2.7 Code and says it outperforms K2.6 by 21.8% on Kimi Code Bench v2 while using 30% fewer reasoning tokens. The release includes open weights and API access, so teams can test the 180 tok/s HighSpeed rollout and early Cline/OpenCode support.

RELEASE2w ago
Anthropic launches Claude Fable 5 with Opus fallback and $10/$50 MTok pricing

Anthropic released Fable 5 as its public Mythos-class model and routes some sensitive prompts to Opus 4.8. Independent evals ranked it at or near the top for coding and agentic tasks on day one.

RELEASE2w ago
Cohere releases North Mini Code: 30B MoE, 3B active, 256K context

Cohere open-sourced North Mini Code, a 30B-parameter coding MoE with 3B active parameters, 256K context, and Apache 2.0 licensing. OpenCode added it the same day, making the release immediately usable in a coding-agent client.

NEWS2w ago
Cognition benchmarks FrontierCode: top model scores 13% with mergeability grading

Cognition introduced FrontierCode, a coding benchmark that grades mergeability and review quality instead of only unit-test passes, and the top model scored 13%. The result matters because it differs from SWE-Bench-style pass rates, and outside researchers are already questioning score variance and reproducibility.

NEWS3w ago
MIT study reports 300% more files but 30% more releases after AI coding adoption

MIT-linked analysis says AI coding tools sharply raise local code output, but most of the gain disappears by review and release. Teams should watch downstream throughput, since project creation rose without matching demand signals in separate Hugging Face Spaces data.

NEWS3w ago
Researchers benchmark AutoLab, SkillOpt, and Meta-Agent Challenge for self-improving agents

New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.

NEWS3w ago
Framework Max+ 395 benchmarks close to M5 Max on Qwen3-TTS with GGML Vulkan

A local benchmark on a 128GB Framework system reported Qwen3-TTS performance close to an M5 Max using a GGML Vulkan backend. The result suggests AMD Strix hardware can approach Apple-class local TTS speed without MLX or Metal.

NEWS3w ago
Kilo Code benchmarks MiniMax M3 vs Claude Opus 4.8: 13/17 bugs at $0.07 vs $1.30

A seeded code-audit benchmark found MiniMax M3 and the cheapest Claude Opus 4.8 run each caught 13 of 17 planted bugs, but at sharply different cost. The results also showed models found different bugs, and higher reasoning settings did not reliably improve cost efficiency.

NEWS3w ago
Anthropic reports Claude wrote 80% of merged code

Anthropic published internal metrics showing Claude wrote 80% of merged code, with 8x engineer output and 52x training-code speedups in Mythos Preview. The post matters because it gives a rare lab-side look at AI-assisted engineering gains, while still saying research judgment remains a bottleneck and recursive self-improvement is unproven.

RELEASE3w ago
NVIDIA releases Nemotron 3 Ultra: 550B MoE, 1M context

NVIDIA shipped Nemotron 3 Ultra, a 550B/55B-active hybrid Mamba-Transformer MoE with open weights, data, and recipe, plus broad runtime and host support. It matters because the model pairs frontier open benchmarks with immediate agent-serving options, though local use still needs heavy quantization or large-memory hardware.

NEWS3w ago
Arena launches Agent Mode rankings with GPT-5.5 High leading

Arena shipped Agent Mode, a benchmark that lets models use web search, bash, file writing, image generation, and follow-up questions, then ranks them on five live-session signals. It matters because agent evals move from static task sets to real user workflows, with GPT-5.5 High currently leading the leaderboard.

RELEASE3w ago
Ideogram 4.0 releases 9.3B open weights with 2K output and non-commercial license

Ideogram released 4.0 as open weights with 2K output, layout control, and strong text rendering, with rollout to ComfyUI, fal, and Hugging Face. Teams can download the design-focused model, but they should check the non-commercial license before using it in production.

NEWS3w ago
Hyper, OpenCode, Kilo, and Vals add Qwen 3.7 Plus support within 72 hours

Two days after Qwen 3.7 Plus launched, Hyper, OpenCode, Kilo, and Vals shipped support or rankings around the 1M-context multimodal model. The rapid pickup shows Alibaba’s new model landing quickly in coding-agent tools and public eval stacks outside its own platform.

RELEASE3w ago
Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025

Microsoft introduced MAI-Thinking-1, MAI-Code-1-Flash, and five other MAI models across code, image, voice, and speech. The launch puts Microsoft back into the frontier-model race and starts landing pieces of the stack in Copilot and partner runtimes.

NEWS3w ago
Vals launches ProgramBench: Opus 4.8 solves 2 of 200 software-reconstruction tasks

Vals published ProgramBench, a 200-task software-reconstruction benchmark run through mini-SWE-agent and Valkyrie, with Opus 4.8 becoming the first model to fully solve two tasks. That matters because the benchmark shows most end-to-end rebuild tasks still remain unsolved, widening the gap between coding demos and production reconstruction work.

RELEASE3w ago
NVIDIA launches Cosmos 3 open 16B and 64B omnimodels with datasets and SGLang support

NVIDIA released Cosmos 3 as an open omnimodel family with 16B and 64B variants, plus code, datasets, and a coalition around physical AI. The release matters because it ships with serving support and top open-weight image and video rankings, so teams can use it beyond a research teaser.

NEWS3w ago
MiniMax M3 users report slow runs and broken code after launch

A day after MiniMax M3 launched, independent testers posted mixed results: cheap demos and design tasks worked, but several coding runs stalled, broke features, or used more tokens than expected. New external numbers added nuance, with Context Arena falling sharply after 64k context and one DeepSWE run passing 15 of 113 tasks.

NEWS3w ago
NVIDIA claims Nemotron 3 Ultra 550B runs 5x faster and 30% cheaper

NVIDIA teased Nemotron 3 Ultra as a 550B open-weight model due later this week, with early messaging centered on 5x faster and 30% cheaper inference plus a hybrid SSM-MoE design. The rollout matters because early benchmark posts already place it near the top of open-weight leaderboards, widening NVIDIA’s open-model push beyond Cosmos.

NEWS4w ago
Opus 4.8 users report token burn, failed tool calls, and DeepSWE gaps

Three days after Opus 4.8 launched, new tests and field reports added failed tool calls, Bash-specific breakdowns, and higher token burn to the complaint list. Users report materially worse cost and stability in long coding sessions, while DeepSWE and GBA Eval point in different directions.

NEWS4w ago
Developers report Codex beats Claude Code on DeepSWE, token burn, and multi-hour /goal sessions

Independent users compared GPT-5.5/Codex with Opus 4.8/Claude Code using DeepSWE cost charts, GBA Eval runs, and long coding sessions. The split matters because engineers choosing a daily coding stack now have external quality-versus-cost evidence instead of only vendor launch claims.

NEWS4w ago
Grok Imagine Video 1.5 adds fal and Venice API access after xAI rollout

Grok Imagine Video 1.5 moved from arena ranking to usable APIs, with xAI docs live and third-party access on fal and Venice. That matters because developers can now script against the model through standard providers, though early #1 arena claims are already being challenged by side-by-side testers.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.