Benchmarks
Benchmark suites, leaderboard caveats, and task measurement.
Stories
Filter storiesIndependent measurements after DSpark put DeepSeek V4-Pro around 90 tok/s and cut one run from 214s to 116s. The gain matters because it lowers serving cost, though tuning details and memory overhead are still unclear.
Datalab’s balanced extraction mode scored 95.9% on a 225-document benchmark and beat Reducto Deep Extract’s 95.1%, according to Vik Paruchuri. The update also adds citations and reasoning, but the benchmark and price comparison are vendor-reported.
PrinzBench added GLM-5.2 and scored it 30/99 for legal research, while a separate LisanBench run placed GLM-5.2-high at #29 and noted high token use. The result matters because it cuts against code-centric GLM hype and points to weak search, statute fidelity, and reasoning on professional legal tasks.
Chandra's developer said Mistral OCR 4 launch numbers for both Chandra and OCR 4 could not be reproduced with public code, and published scripts to show the gaps. The dispute matters because Mistral OCR 4 launched on leaderboard claims, and benchmark settings now directly affect model selection.
Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.
OpenRouter released an MCP server that lets agents query live model pricing, benchmark scores, provider data, docs, and run test inference from the CLI. That replaces stale model knowledge with current routing data inside long-running agent workflows.
DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.
Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.
Baidu released Unlimited OCR as an open-source long-document OCR model with 3B total parameters and 500M active at inference. Early ParseBench testing says it is strong on tables and reading order but weaker on semantic formatting and charts, giving teams a new open-weight OCR option with clear tradeoffs.
GLM-5.2 added Perplexity Agent API, Droid, and more hosting options, while Baseten reported over 280 TPS and sub-0.8s TTFT. Builders should watch the cost and benchmark data as it moves into production agent stacks.
Sakana launched Fugu Ultra on AI Gateway and published a technical report, with early testers sharing mixed results. Reports mention polished outputs on some tasks, but also 30-minute runs, uneven coding quality, and much higher cost than GLM-5.2.
Vals AI launched SkillsBench, a public benchmark for measuring how reusable skills change coding-agent performance, and reported average accuracy rising from 35.5% to 52.5%. The results matter because they suggest some workflows can move to cheaper models when task-specific skills are available.
Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.
Independent tests put GLM-5.2 near Opus 4.8 and GPT-5.5 on planning and coding, and users shared Claude Code, BrowserCode, dcode, and local-serving recipes. It matters because many engineers are treating it as a daily-driver option for text-heavy coding, though teams still report weaker vision and provider limits.
ComputeSDK published results from its 2026 100k Scale Invitational after weeks of reruns and infra tuning across Modal, Tensorlake, Northflank, Declaw AI, E2B, and Isorun. It matters because sandbox and agent infra claims now have a shared public concurrency target instead of vendor-specific load demos.
lift-pdf released an open-source 9B model for schema-constrained document extraction, with code, pip install, playground access, and a 90.2% score on the team's 225-document bench. It matters because the model claims near-Gemini 3.5 Flash accuracy at 9.5s p50, though coverage is still skewed toward Latin-language docs and commercial-use limits remain.
Kilo Code now shows Terminal Bench completion rate and average attempt cost directly in model details inside its CLI and VS Code extension. It matters because the numbers come from Kilo's own harness and retry logic rather than public leaderboard scaffolds.
Poolside released Apache 2.0 weights for Laguna M.1 and XS.2, its long-horizon coding models, with M.1 shipping at 225B total parameters, 23B active, and 256K context. SGLang and vLLM support on day one lets teams run and fine-tune the models in existing agent stacks immediately.
Artificial Analysis launched AA-Briefcase, a benchmark for multi-week knowledge-work projects with thousands of source files, and Claude Fable 5 leads at 1587 Elo. The first results show a wide cost spread, so teams should compare both quality and task cost before choosing a model.
Fresh third-party results put GLM-5.2 atop multiple open-model leaderboards, including the AA Coding Index, Vals Index, Terminal Bench 2.1, and Design Arena. The scores add independent confirmation, though demand spiked enough to strain some providers.
Z.ai released GLM-5.2 MIT-licensed open weights with 1M context and broad runtime support. Vendor and arena results put it near frontier closed models on long-horizon coding.
Moonshot rolled out HighSpeed for Kimi K2.7 Code, claiming about 180 tok/s on coding tasks, up to 260 tok/s on shorter contexts, and roughly 6x speedups. Watch the tight capacity limits and mixed benchmark results, and budget for the 2x pricing if you want the faster mode.
TryCua and Snorkel opened Cua-Bench, a computer-use benchmark with 25 expert-authored KiCad tasks graded by exact netlist matches. The early results show frontier models still struggle with GUI execution, wiring completion, and self-checking, so treat benchmark wins as incomplete for real computer-use work.
OpenRouter launched Fusion, a server-side panel API that sends prompts to multiple models and combines one answer. Early logs also showed a web-path issue where Fusion still invoked Claude Opus 4.8 as judge and billed for it until API-side control was clarified.
GLM-5.2 opened to GLM Coding Plan users and posters claimed #1 BridgeBench scores in BS and Reasoning, with one post citing 1/10th the cost and 300 tokens per second. Early frontend tests still found a gap to Fable 5 and Opus on finer visual details.
Two days after Fable 5 went offline, developers started testing GLM-5.2, GPT-5.5, and multi-model panels against the kinds of one-shot frontend and greenfield builds Fable handled well. The early pattern is that replacements cover much of the work, but Fable still leads on UI taste and first-pass product completion.
Together AI said its DeepSeek V4 Pro deployment now leads Artificial Analysis on both output speed and latency. The claim matters because it turns V4 serving into an inference-systems story about KV cache reuse, prefix reuse, kernels, and endpoint profiles rather than model weights alone.
Z.ai made GLM-5.2 available to GLM Coding Plan users with High and Max thinking modes, 1M context, and promised API plus MIT open source next week. Early testers reported higher plan pricing, heavy rate limits, and mixed build quality versus Opus and Fable.
Vals posted new external results for Kimi K2.7 Code, ranking it the top open-weight model on SWE-bench and Terminal-Bench 2.1. The results give Moonshot's launch claims an outside benchmark line on repo and terminal-heavy tasks.
Moonshot open-sourced Kimi K2.7 Code and says it outperforms K2.6 by 21.8% on Kimi Code Bench v2 while using 30% fewer reasoning tokens. The release includes open weights and API access, so teams can test the 180 tok/s HighSpeed rollout and early Cline/OpenCode support.
Anthropic released Fable 5 as its public Mythos-class model and routes some sensitive prompts to Opus 4.8. Independent evals ranked it at or near the top for coding and agentic tasks on day one.
Cohere open-sourced North Mini Code, a 30B-parameter coding MoE with 3B active parameters, 256K context, and Apache 2.0 licensing. OpenCode added it the same day, making the release immediately usable in a coding-agent client.
Cognition introduced FrontierCode, a coding benchmark that grades mergeability and review quality instead of only unit-test passes, and the top model scored 13%. The result matters because it differs from SWE-Bench-style pass rates, and outside researchers are already questioning score variance and reproducibility.
MIT-linked analysis says AI coding tools sharply raise local code output, but most of the gain disappears by review and release. Teams should watch downstream throughput, since project creation rose without matching demand signals in separate Hugging Face Spaces data.
New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.
A local benchmark on a 128GB Framework system reported Qwen3-TTS performance close to an M5 Max using a GGML Vulkan backend. The result suggests AMD Strix hardware can approach Apple-class local TTS speed without MLX or Metal.
A seeded code-audit benchmark found MiniMax M3 and the cheapest Claude Opus 4.8 run each caught 13 of 17 planted bugs, but at sharply different cost. The results also showed models found different bugs, and higher reasoning settings did not reliably improve cost efficiency.
Anthropic published internal metrics showing Claude wrote 80% of merged code, with 8x engineer output and 52x training-code speedups in Mythos Preview. The post matters because it gives a rare lab-side look at AI-assisted engineering gains, while still saying research judgment remains a bottleneck and recursive self-improvement is unproven.
NVIDIA shipped Nemotron 3 Ultra, a 550B/55B-active hybrid Mamba-Transformer MoE with open weights, data, and recipe, plus broad runtime and host support. It matters because the model pairs frontier open benchmarks with immediate agent-serving options, though local use still needs heavy quantization or large-memory hardware.
Arena shipped Agent Mode, a benchmark that lets models use web search, bash, file writing, image generation, and follow-up questions, then ranks them on five live-session signals. It matters because agent evals move from static task sets to real user workflows, with GPT-5.5 High currently leading the leaderboard.
Ideogram released 4.0 as open weights with 2K output, layout control, and strong text rendering, with rollout to ComfyUI, fal, and Hugging Face. Teams can download the design-focused model, but they should check the non-commercial license before using it in production.
Two days after Qwen 3.7 Plus launched, Hyper, OpenCode, Kilo, and Vals shipped support or rankings around the 1M-context multimodal model. The rapid pickup shows Alibaba’s new model landing quickly in coding-agent tools and public eval stacks outside its own platform.
Microsoft introduced MAI-Thinking-1, MAI-Code-1-Flash, and five other MAI models across code, image, voice, and speech. The launch puts Microsoft back into the frontier-model race and starts landing pieces of the stack in Copilot and partner runtimes.
Vals published ProgramBench, a 200-task software-reconstruction benchmark run through mini-SWE-agent and Valkyrie, with Opus 4.8 becoming the first model to fully solve two tasks. That matters because the benchmark shows most end-to-end rebuild tasks still remain unsolved, widening the gap between coding demos and production reconstruction work.
NVIDIA released Cosmos 3 as an open omnimodel family with 16B and 64B variants, plus code, datasets, and a coalition around physical AI. The release matters because it ships with serving support and top open-weight image and video rankings, so teams can use it beyond a research teaser.
A day after MiniMax M3 launched, independent testers posted mixed results: cheap demos and design tasks worked, but several coding runs stalled, broke features, or used more tokens than expected. New external numbers added nuance, with Context Arena falling sharply after 64k context and one DeepSWE run passing 15 of 113 tasks.
NVIDIA teased Nemotron 3 Ultra as a 550B open-weight model due later this week, with early messaging centered on 5x faster and 30% cheaper inference plus a hybrid SSM-MoE design. The rollout matters because early benchmark posts already place it near the top of open-weight leaderboards, widening NVIDIA’s open-model push beyond Cosmos.
Three days after Opus 4.8 launched, new tests and field reports added failed tool calls, Bash-specific breakdowns, and higher token burn to the complaint list. Users report materially worse cost and stability in long coding sessions, while DeepSWE and GBA Eval point in different directions.
Independent users compared GPT-5.5/Codex with Opus 4.8/Claude Code using DeepSWE cost charts, GBA Eval runs, and long coding sessions. The split matters because engineers choosing a daily coding stack now have external quality-versus-cost evidence instead of only vendor launch claims.
Grok Imagine Video 1.5 moved from arena ranking to usable APIs, with xAI docs live and third-party access on fal and Venice. That matters because developers can now script against the model through standard providers, though early #1 arena claims are already being challenged by side-by-side testers.