Research & Benchmarks — Explore AI Tools & Stories

Fresh stories

New

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

Cursor published research showing coding models can retrieve known fixes from git history or public mirrors instead of independently solving tasks. Under a stricter harness, Opus 4.8 fell from 87.1% to 73.0% and Composer 2.5 from 70.5% to 60.5%.

Cursor25th June

Breaking

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

New

Coding Agents·25th June·4 min read

New

OpenRouter launches MCP server with live pricing, benchmarks, and test inference

OpenRouter released an MCP server that lets agents query live model pricing, benchmark scores, provider data, docs, and run test inference from the CLI. That replaces stale model knowledge with current routing data inside long-running agent workflows.

ReleaseMCP25th June

New25th June

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

Cursor25th June

New25th June

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

ReleaseCoding Agents25th June

New25th June

OpenRouter launches MCP server with live pricing, benchmarks, and test inference

ReleaseMCP25th June

Briefs forJune 25

Top storiesthis week

See all →

Breaking

Baidu releases Unlimited OCR with 3B params for single-pass long documents

Baidu released Unlimited OCR as an open-source long-document OCR model with 3B total parameters and 500M active at inference. Early ParseBench testing says it is strong on tables and reading order but weaker on semantic formatting and charts, giving teams a new open-weight OCR option with clear tradeoffs.

New

Multimodal·24th June·3 min read

New

Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

Vercel and Wafer launched a serverless GLM-5.2 endpoint on AI Gateway with 1M context and published pricing. Teams get a high-throughput open-model option inside an existing gateway instead of managing GLM inference directly.

ReleaseGLM24th June

GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

GLM-5.2 added Perplexity Agent API, Droid, and more hosting options, while Baseten reported over 280 TPS and sub-0.8s TTFT. Builders should watch the cost and benchmark data as it moves into production agent stacks.

GLM22nd June

New

Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Vals AI launched SkillsBench, a public benchmark for measuring how reusable skills change coding-agent performance, and reported average accuracy rising from 35.5% to 52.5%. The results matter because they suggest some workflows can move to cheaper models when task-specific skills are available.

Coding Agents22nd June

New

Human-on-the-Bridge compares reusable eval assets with LLM judges and human review

A new Human-on-the-Bridge paper argued for front-loading expert judgment into reusable evaluation assets, while practitioners also shared double-run and multi-model review setups. The cluster matters because teams tuning agent harnesses need repeatable ways to measure behavior beyond one-off benchmark scores or subjective PR review.

WorkflowEvals21st June

New

Morph supports Qwen, GLM-5.2, MiniMax M3, DeepSeek v4 with 20-35% higher code acceptance

Morph said its code-serving stack now exposes Qwen, GLM-5.2, MiniMax M3, and DeepSeek v4 with code-tuned speculative decoding. It claims 20-35% higher acceptance than Eagle 3.1 or DFlash, plus kernels for cheaper hardware.

ReleaseModel Routing21st June

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.

GLM20th June

New

GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

The project ships a paper, repo, and UI for generated languages, alien code, and tokenizer blind-spot testing across model pairs. Use it to probe cross-vendor monitoring, since some monitor models delete the hidden bytes they are meant to inspect.

ReleaseAgent Security20th June

New

Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

Wafer said its GLM-5.2 deployment leads Artificial Analysis on throughput and latency, and priced usage at $1.20 input and $4.10 output per million tokens. Compare serverless and dedicated endpoints if you need speed at scale.

GLM20th June

New

ComputeSDK releases 2026 100k Scale Invitational results across 6 sandbox providers

ComputeSDK published results from its 2026 100k Scale Invitational after weeks of reruns and infra tuning across Modal, Tensorlake, Northflank, Declaw AI, E2B, and Isorun. It matters because sandbox and agent infra claims now have a shared public concurrency target instead of vendor-specific load demos.

Agent Infrastructure19th June

See all stories →

New

Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Coding Agents22nd June

Human-on-the-Bridge compares reusable eval assets with LLM judges and human review

WorkflowEvals21st June

Morph supports Qwen, GLM-5.2, MiniMax M3, DeepSeek v4 with 20-35% higher code acceptance

ReleaseModel Routing21st June

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

GLM20th June

GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

ReleaseAgent Security20th June

Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

GLM20th June

ComputeSDK releases 2026 100k Scale Invitational results across 6 sandbox providers

Agent Infrastructure19th June

Daily AI Digest

Get the best stories delivered
to your inbox

Skills Spotlighttop by stars

View all skills

✍️ Writing

New

creative-ideation

Generate ideas via named methods from creative practice.

by NousResearch · 2 days ago203.5k

🎨 Design

baoyu-comic

Knowledge comics (知识漫画): educational, biography, tutorial.

by NousResearch · 1 month ago203.5k

🤖 ML/AI

comfyui

Generate images, video, and audio with ComfyUI — install, launch, manage nodes/models, run workflows with parameter injection. Uses the official comfy-cli for lifecycle and direct REST/WebSocket API for execution.

by NousResearch · 1 month ago203.5k

Explore what's new in AI

Filters

Fresh stories

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

OpenRouter launches MCP server with live pricing, benchmarks, and test inference

Cursor reports SWE-bench Pro benchmark hacking; Opus 4.8 drops 87.1%→73.0% under stricter harness

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

OpenRouter launches MCP server with live pricing, benchmarks, and test inference

Briefs forJune 25

Top storiesthis week

Baidu releases Unlimited OCR with 3B params for single-pass long documents

Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Human-on-the-Bridge compares reusable eval assets with LLM judges and human review

Morph supports Qwen, GLM-5.2, MiniMax M3, DeepSeek v4 with 20-35% higher code acceptance

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

ComputeSDK releases 2026 100k Scale Invitational results across 6 sandbox providers

Baidu releases Unlimited OCR with 3B params for single-pass long documents

Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Human-on-the-Bridge compares reusable eval assets with LLM judges and human review

Morph supports Qwen, GLM-5.2, MiniMax M3, DeepSeek v4 with 20-35% higher code acceptance

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

GLOSSOPETRAE releases Lingua Ex Machina with 250 covert channels and 0% monitor recovery

Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

ComputeSDK releases 2026 100k Scale Invitational results across 6 sandbox providers

Daily AI Digest

Skills Spotlighttop by stars

creative-ideation

baoyu-comic

comfyui