TOPIC50 stories

Benchmarks

Benchmark suites, leaderboard caveats, and task measurement.

Stories

Nous Research releases TST with 2-3x pretraining speedup at matched FLOPs

Nous Research introduced Token Superposition Training, which bags tokens early in pretraining before returning to next-token prediction. The team says TST cuts wall-clock training 2-3x at matched FLOPs while leaving the deployed model unchanged.

RELEASE12th May

Perceptron releases Mk1 with 2 FPS video reasoning, 32K context, and $0.15 per 1M input

Perceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.

RELEASE12th May

SophontAI releases Medmarks v1.0 with 30 medical benchmarks and 61 models

SophontAI released Medmarks v1.0, expanding its open medical LLM evaluation suite to 30 benchmarks and 61 models alongside a technical report. It gives teams a larger open baseline for medical post-training and model selection, with more benchmarks and model coverage still planned.

NEWS11th May

Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.

NEWS10th May

GPT-5.5 users report 3.3M cached tokens and 2.5x /fast credits

Engineers shared fresh measurements on GPT-5.5 cache reuse, /fast pricing, and bug-finding budgets after comparison posts for GPT-5.5 and Opus 4.7 led the coding round-up. The reports suggest Codex cost and quality now swing on cache behavior and effort settings as much as on list prices.

RELEASE10th May

DFlash adds Qwen3-8B speculator with 82.2% first-token acceptance

Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.

NEWS9th May

GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use

User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

RELEASE9th May

ERNIE 5.1 Preview ranks No. 4 on Search Arena and claims 6% pretraining cost

Baidu pushed ERNIE 5.1 Preview with new leaderboard claims, including No. 4 on Search Arena and No. 13 on LMArena Text. Treat the 6% pretraining cost claim cautiously until an independent technical report confirms it.

NEWS8th May

METR says Claude Mythos Preview hits 16-hour p50 Horizon in early snapshot

METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.

RELEASE1w ago

Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.

RELEASE1w ago

ChatGPT ships GPT-5.5 Instant by default with Memory Sources

OpenAI is rolling GPT-5.5 Instant into ChatGPT as the default model and exposing it as gpt-5.5-chat-latest, alongside Memory Sources for personalized replies. The model also claims 52.5% fewer high-stakes hallucinations, so watch for behavior changes in production prompts.

NEWS1w ago

ProgramBench reports 0% on ffmpeg, SQLite, and ripgrep rebuilds without internet

The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.

NEWS1w ago

ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3

ARC Prize published frontier-model results on ARC-AGI-3 and said GPT-5.5 and Opus 4.7 both stayed below 1%, with failures in world modeling, abstraction, and reward reinforcement. That shows strong coding and benchmark models still break on novel interactive reasoning tasks, and follow-up comparisons even had Opus 4.6 slightly ahead of 4.7.

NEWS1w ago

ValsAI updates Terminal Bench 2 after `tool_choice` bug, moving GPT-5.5 to #1 with +11%

ValsAI found that undocumented `tool_choice` behavior was skewing Terminal Bench 2 scores when no native tools were used, then reran the evals. The correction lifted GPT-5.5 by 11% to the top slot and showed how much harness settings can move coding-agent results.

RELEASE1w ago

IBM releases Granite Embedding R2 with 32,768-token context and +11.8 MMTEB retrieval gain

IBM released 97M and 311M multilingual Granite Embedding R2 models under Apache 2.0, replacing XLM-RoBERTa with ModernBERT and extending context length from 512 to 32,768 tokens. The 311M model posts a +11.8 gain on MMTEB retrieval and ships with ONNX, OpenVINO, vLLM, and GGUF support.

NEWS2w ago

GPT-5.5 ranks at 71.4% on UK AISI cyber eval with 2/10 TLO completions

Multiple summaries of the UK AISI report say GPT-5.5 roughly matches Claude Mythos Preview on long-horizon cyber tasks, including 2 of 10 end-to-end TLO completions. That matters because the model is broadly usable today, shifting cyber-workflow choices toward availability and mitigations rather than gated access alone.

RELEASE2w ago

Grok 4.3 drops to $1.25/$2.50 with 1M context

Provider and benchmark trackers listed Grok 4.3 with 1M context and lower token pricing, and OpenRouter and Venice exposed it through their APIs. The model undercuts Opus 4.7 and GPT-5.5 on price while independent evaluations show stronger legal and finance performance than general coding.

RELEASE2w ago

IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0

IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

RELEASE2w ago

Poolside releases Laguna M.1 and XS.2 coding models with 225B/23B and 33B/3B MoEs

Poolside opened Laguna M.1 and Laguna XS.2 as its first public coding models, with Apache 2.0 weights and same-day provider support. That gives teams open coding models that can run locally or through standard serving stacks.

NEWS2w ago

Users report GPT-5.5 speeds up coding and cuts over-editing in low-reasoning runs

New evals and day-three user tests show GPT-5.5 performing well at low or medium reasoning, with benchmark gains over GPT-5.4 in coding-heavy use. That matters because stronger results no longer require xhigh runs, though some users still flag sycophancy.

NEWS2w ago

Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

NEWS2w ago

DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

NEWS2w ago

GPT-5.5 users report 4-10x shorter runs and smoother tool calls one day after launch

Users and third-party evals reported shorter runs, stronger long-context scores, and faster rollout into Cursor and other tools a day after GPT-5.5 hit the API. Higher per-token pricing may be partly offset by lower loop time and fewer tool-call stalls, so watch early bench data before changing defaults.

RELEASE2w ago

Qwen-Image-2.0-Pro launches at #9 on Arena with multilingual text rendering

Alibaba launched Qwen-Image-2.0-Pro on ModelScope and API with better prompt adherence, multilingual typography, and steadier style quality. The model is aimed at text-heavy jobs like UI mockups and posters, so test it for layout-heavy generation.

RELEASE2w ago

DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

RELEASE2w ago

BidirLM-Omni-2.5B-Embedding launches 2048-dim text-image-audio vectors

BidirLM released a 2.5B multilingual encoder that embeds text, images, and audio into one shared 2048-dimensional space and works directly with Sentence Transformers. It tops several open-data embedding leaderboards and can run locally on GPU.

RELEASE3w ago

DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

RELEASE3w ago

OpenAI releases GPT-5.5 with 82.7% Terminal-Bench and Codex browser control

OpenAI rolled out GPT-5.5 and GPT-5.5 Pro in ChatGPT and Codex, with higher scores on terminal, OS, cyber, and math evals than GPT-5.4. Codex also gained browser, document, and computer-use features for longer agent workflows.

RELEASE3w ago

Tencent launches Hy3 preview with 295B/21B, 256K context, and day-one OpenRouter, vLLM, and SGLang support

Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.

RELEASE3w ago

Qwen3.6-27B releases with 77.2 SWE-Bench Verified and Apache 2.0

Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.

NEWS3w ago

Developers report QR codes, floor plans, and poster workflows one day after GPT Image 2 launch

A day after GPT Image 2 launched, developers and tool vendors posted reproducible workflows for floor plans, QR codes, conference posters, typography, and Figma-style asset generation. The follow-up matters because it shows where text-heavy visual generation is already usable, but also that quality depends heavily on mode choice, image size, and surrounding tool scaffolding.

RELEASE3w ago

Xiaomi MiMo-V2.5-Pro releases with 57.2 SWE-Bench Pro, 1M context, and OpenRouter access

Xiaomi’s MiMo-V2.5-Pro and MiMo-V2.5 arrived with million-token context windows, stronger coding and agentic claims, and immediate access through OpenRouter plus agent harnesses. The rollout adds another low-cost Chinese frontier model that engineers can route into coding workflows without waiting for a proprietary IDE deal.

NEWS3w ago

OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview

OpenAI introduced a free ChatGPT tier for verified U.S. clinicians and released HealthBench Professional, an open benchmark built from real clinical chat tasks. The launch pairs a clinician-facing workflow product with a public evaluation set and published model results.

RELEASE3w ago

OpenAI launches GPT Image 2 with thinking, 2K outputs, and text rendering gains

OpenAI released GPT Image 2 in ChatGPT, Codex, and the API with thinking mode and 2K outputs. Early tests and Arena scores suggest it is usable for slides, UI mockups, and dense infographic layouts.

RELEASE3w ago

LightOn releases LateOn and DenseOn at 149M params with BEIR 57.22

LightOn open-sourced DenseOn and LateOn plus the training pipeline behind them, including 1.4 billion query-document pairs and decontaminated BEIR results. Teams can use the small open retrieval models and reproduced data mixtures instead of opaque closed-data baselines.

RELEASE3w ago

Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.

RELEASE3w ago

Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs

Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.

RELEASE3w ago

Qwen launches Qwen3.6-Max-Preview on Qwen Chat with AA Index 52

Qwen put Qwen3.6-Max-Preview live on Qwen Chat as an early flagship preview with stronger agentic coding and world-knowledge claims. Early testers report strong first-pass results, but the Max line remains closed rather than open-sourced.

NEWS3w ago

Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.

NEWS3w ago

Opus 4.7 users report instruction-following misses, refusals, and ~1.3x token burn a day after launch

A day after Opus 4.7 launched, users are surfacing adaptive-thinking misses, surprise refusals, and higher token use. For engineers, recheck prompts, costs, and 4.6 fallbacks while Anthropic patches bugs and lifts limits.

WORKFLOW3w ago

Unsloth benchmarks Qwen3.6-35B-A3B GGUF quants at 20-40 tok/s on local rigs

Unsloth published GGUF quant benchmarks for Qwen3.6-35B-A3B while practitioners shared local setup guides and long-context agent runs on Apple silicon and high-RAM desktops. The sparse 35B model is becoming a credible local coding-agent option, but speed and reasoning quality still vary by quant and offload strategy.

RELEASE3w ago

Tencent releases HY-World 2.0 with WorldMirror 2.0 and editable 3D worlds

Tencent released HY-World 2.0, a multimodal world model that turns text, images, or video into editable 3D worlds, and open-sourced WorldMirror 2.0 inference code and weights. Its four-stage pipeline targets reusable scene assets rather than single-view video clips.

RELEASE4w ago

GPT-Rosalind introduces life sciences reasoning in trusted-access preview

OpenAI launched GPT-Rosalind for biology, drug discovery, and translational medicine, plus a life sciences plugin for Codex. Access starts as a trusted preview for qualified customers, so near-term use is limited to partner and enterprise workflows.

NEWS4w ago

Parcae claims 1.3B Transformer quality from a 770M looped model

Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.

RELEASE4w ago

Google DeepMind releases Gemini Robotics-ER 1.6 with 93% instrument reading

Google DeepMind shipped Gemini Robotics-ER 1.6 to the Gemini API and AI Studio with better visual-spatial reasoning, multi-view success detection, and gauge reading. The model's 93% instrument-reading score targets robots that need to reason over cluttered scenes and physical constraints.

RELEASE4w ago

Hugging Face Hub launches Kernels with 1.7x-2.5x PyTorch speedups

Hugging Face introduced Kernels on the Hub to publish pre-compiled GPU kernels matched to GPU, PyTorch version, and OS. The packaging makes kernel optimizations shareable and claims 1.7x to 2.5x speedups over PyTorch baselines with torch.compile compatibility.

RELEASE4w ago

MiniMax M2.7 supports 128 GB GGUF runs and day-0 cloud hosting

MiniMax M2.7 moved from announcement to deployment, with GGUF guidance for 128 GB local systems and same-day availability on Together, Fireworks, Hugging Face, and ModelScope. Use the local and managed serving options now, but check the non-commercial license before adopting the 230B model.

RELEASE4w ago

MiniMax releases M2.7 open model with 56.22% SWE-Pro and 57.0% Terminal Bench 2

MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.

NEWS4w ago

Meerkat reports harness-level cheating across 28+ submissions on nine agent benchmarks

Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.

NEWS4w ago

Vercel Sandbox benchmarks sub-500 ms node -v cold starts

Vercel said Sandbox is now the fastest microVM-based runtime, with fresh node -v cold starts now largely under 500 ms after a month of tuning. The update also puts persistent sandboxes into beta and expands plans for a programmable firewall, so teams should re-check runtime and security settings.