Benchmarks
Benchmark suites, leaderboard caveats, and task measurement.
Stories
Filter storiesNous Research introduced Token Superposition Training, which bags tokens early in pretraining before returning to next-token prediction. The team says TST cuts wall-clock training 2-3x at matched FLOPs while leaving the deployed model unchanged.
Perceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.
SophontAI released Medmarks v1.0, expanding its open medical LLM evaluation suite to 30 benchmarks and 61 models alongside a technical report. It gives teams a larger open baseline for medical post-training and model selection, with more benchmarks and model coverage still planned.
Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.
Engineers shared fresh measurements on GPT-5.5 cache reuse, /fast pricing, and bug-finding budgets after comparison posts for GPT-5.5 and Opus 4.7 led the coding round-up. The reports suggest Codex cost and quality now swing on cache behavior and effort settings as much as on list prices.
Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.
User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.
Baidu pushed ERNIE 5.1 Preview with new leaderboard claims, including No. 4 on Search Arena and No. 13 on LMArena Text. Treat the 6% pretraining cost claim cautiously until an independent technical report confirms it.
METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.
Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.
OpenAI is rolling GPT-5.5 Instant into ChatGPT as the default model and exposing it as gpt-5.5-chat-latest, alongside Memory Sources for personalized replies. The model also claims 52.5% fewer high-stakes hallucinations, so watch for behavior changes in production prompts.
The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.
ARC Prize published frontier-model results on ARC-AGI-3 and said GPT-5.5 and Opus 4.7 both stayed below 1%, with failures in world modeling, abstraction, and reward reinforcement. That shows strong coding and benchmark models still break on novel interactive reasoning tasks, and follow-up comparisons even had Opus 4.6 slightly ahead of 4.7.
ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.
IBM released 97M and 311M multilingual Granite Embedding R2 models under Apache 2.0, replacing XLM-RoBERTa with ModernBERT and extending context length from 512 to 32,768 tokens. The 311M model posts a +11.8 gain on MMTEB retrieval and ships with ONNX, OpenVINO, vLLM, and GGUF support.
Multiple summaries of the UK AISI report say GPT-5.5 roughly matches Claude Mythos Preview on long-horizon cyber tasks, including 2 of 10 end-to-end TLO completions. That matters because the model is broadly usable today, shifting cyber-workflow choices toward availability and mitigations rather than gated access alone.
Provider and benchmark trackers listed Grok 4.3 with 1M context and lower token pricing, and OpenRouter and Venice exposed it through their APIs. The model undercuts Opus 4.7 and GPT-5.5 on price while independent evaluations show stronger legal and finance performance than general coding.
IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.
Poolside opened Laguna M.1 and Laguna XS.2 as its first public coding models, with Apache 2.0 weights and same-day provider support. That gives teams open coding models that can run locally or through standard serving stacks.
New evals and day-three user tests show GPT-5.5 performing well at low or medium reasoning, with benchmark gains over GPT-5.4 in coding-heavy use. That matters because stronger results no longer require xhigh runs, though some users still flag sycophancy.
Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.
DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.
Users and third-party evals reported shorter runs, stronger long-context scores, and faster rollout into Cursor and other tools a day after GPT-5.5 hit the API. Higher per-token pricing may be partly offset by lower loop time and fewer tool-call stalls, so watch early bench data before changing defaults.
Alibaba launched Qwen-Image-2.0-Pro on ModelScope and API with better prompt adherence, multilingual typography, and steadier style quality. The model is aimed at text-heavy jobs like UI mockups and posters, so test it for layout-heavy generation.
Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.
BidirLM released a 2.5B multilingual encoder that embeds text, images, and audio into one shared 2048-dimensional space and works directly with Sentence Transformers. It tops several open-data embedding leaderboards and can run locally on GPU.
DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.
OpenAI rolled out GPT-5.5 and GPT-5.5 Pro in ChatGPT and Codex, with higher scores on terminal, OS, cyber, and math evals than GPT-5.4. Codex also gained browser, document, and computer-use features for longer agent workflows.
Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.
Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.
A day after GPT Image 2 launched, developers and tool vendors posted reproducible workflows for floor plans, QR codes, conference posters, typography, and Figma-style asset generation. The follow-up matters because it shows where text-heavy visual generation is already usable, but also that quality depends heavily on mode choice, image size, and surrounding tool scaffolding.
Xiaomi’s MiMo-V2.5-Pro and MiMo-V2.5 arrived with million-token context windows, stronger coding and agentic claims, and immediate access through OpenRouter plus agent harnesses. The rollout adds another low-cost Chinese frontier model that engineers can route into coding workflows without waiting for a proprietary IDE deal.
OpenAI introduced a free ChatGPT tier for verified U.S. clinicians and released HealthBench Professional, an open benchmark built from real clinical chat tasks. The launch pairs a clinician-facing workflow product with a public evaluation set and published model results.
OpenAI released GPT Image 2 in ChatGPT, Codex, and the API with thinking mode and 2K outputs. Early tests and Arena scores suggest it is usable for slides, UI mockups, and dense infographic layouts.
LightOn open-sourced DenseOn and LateOn plus the training pipeline behind them, including 1.4 billion query-document pairs and decontaminated BEIR results. Teams can use the small open retrieval models and reproduced data mixtures instead of opaque closed-data baselines.
Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.
Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.
Qwen put Qwen3.6-Max-Preview live on Qwen Chat as an early flagship preview with stronger agentic coding and world-knowledge claims. Early testers report strong first-pass results, but the Max line remains closed rather than open-sourced.
Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.
A day after Opus 4.7 launched, users are surfacing adaptive-thinking misses, surprise refusals, and higher token use. For engineers, recheck prompts, costs, and 4.6 fallbacks while Anthropic patches bugs and lifts limits.
Unsloth published GGUF quant benchmarks for Qwen3.6-35B-A3B while practitioners shared local setup guides and long-context agent runs on Apple silicon and high-RAM desktops. The sparse 35B model is becoming a credible local coding-agent option, but speed and reasoning quality still vary by quant and offload strategy.
Tencent released HY-World 2.0, a multimodal world model that turns text, images, or video into editable 3D worlds, and open-sourced WorldMirror 2.0 inference code and weights. Its four-stage pipeline targets reusable scene assets rather than single-view video clips.
OpenAI launched GPT-Rosalind for biology, drug discovery, and translational medicine, plus a life sciences plugin for Codex. Access starts as a trusted preview for qualified customers, so near-term use is limited to partner and enterprise workflows.
Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.
Google DeepMind shipped Gemini Robotics-ER 1.6 to the Gemini API and AI Studio with better visual-spatial reasoning, multi-view success detection, and gauge reading. The model's 93% instrument-reading score targets robots that need to reason over cluttered scenes and physical constraints.
Hugging Face introduced Kernels on the Hub to publish pre-compiled GPU kernels matched to GPU, PyTorch version, and OS. The packaging makes kernel optimizations shareable and claims 1.7x to 2.5x speedups over PyTorch baselines with torch.compile compatibility.
MiniMax M2.7 moved from announcement to deployment, with GGUF guidance for 128 GB local systems and same-day availability on Together, Fireworks, Hugging Face, and ModelScope. Use the local and managed serving options now, but check the non-commercial license before adopting the 230B model.
MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.
Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.
Vercel said Sandbox is now the fastest microVM-based runtime, with fresh node -v cold starts now largely under 500 ms after a month of tuning. The update also puts persistent sandboxes into beta and expands plans for a programmable firewall, so teams should re-check runtime and security settings.