Skip to content
AI Primer
TOPIC50 stories

Coding Agents

Umbrella tag for the coding-agent space as a category. Prefer the narrower sub-tags agent-product-launch or agent-pattern. Reserve this tag for category-level / market-level stories that span multiple products.

NEWS27th June
OpenRouter reports four open-weight models handle agents; Chinese models hit 45% of traffic

OpenRouter said four open-weight models now handle real agentic workloads, and a JPMorgan report put Chinese models at about 45% of platform traffic. The shift matters because teams are optimizing for price, hosting, and task fit instead of defaulting to frontier APIs.

RELEASE26th June
Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score

Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

RELEASE26th June
Next.js 16.3 Preview adds AGENTS.md, agent-browser, and next-dev-loop Skills

Next.js previewed an agent-focused toolchain with auto-managed AGENTS.md, browser-backed verification, and Skills for cache-component migration and optimization. The release matters because framework guidance, browser introspection, and fix prompts are now packaged directly for coding agents.

NEWS25th June
OpenAI reports Codex drives 99.8% of internal AI output tokens

OpenAI published usage data showing Codex now generates 99.8% of its internal AI output tokens, with sharp growth in legal, support, recruiting, and finance. The report measures agent adoption as delegated parallel work, not just chat inside engineering.

RELEASE25th June
DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

NEWS22nd June
Vals AI releases SkillsBench with a 17-point coding-agent gain and MiniMax-M3 at +25.4

Vals AI launched SkillsBench, a public benchmark for measuring how reusable skills change coding-agent performance, and reported average accuracy rising from 35.5% to 52.5%. The results matter because they suggest some workflows can move to cheaper models when task-specific skills are available.

NEWS1w ago
GLM-5.2 ships to BrowserCode, Hyper, OpenCode, and Together in 3 days

BrowserCode, Hyper, OpenCode, Together, and other vendors added GLM-5.2 soon after release. That turns the open model into a deployable option across coding, browser automation, and hosted chat.

WORKFLOW1w ago
GLM-5.2 ships in Claude Code, Droid, and 2-bit GGUF workflows

Builders published Claude Code and Droid setups for GLM-5.2 while Unsloth quantized it for local 256GB machines and Hugging Face opened temporary free inference. Teams can now run the open-weight model across hosted, local, and agent workflows.

RELEASE1w ago
Poolside releases Laguna M.1 open weights with 225B MoE and 256K context

Poolside released Apache 2.0 weights for Laguna M.1 and XS.2, its long-horizon coding models, with M.1 shipping at 225B total parameters, 23B active, and 256K context. SGLang and vLLM support on day one lets teams run and fine-tune the models in existing agent stacks immediately.

RELEASE1w ago
OpenHands adds Agent Client Protocol support to Agent Canvas, SDK, and Cloud

OpenHands added Agent Client Protocol support to its Agent Canvas, SDK, and Cloud, letting teams run different coding agents through one interface across local, remote, and cloud backends. The release also underpins new OpenHands Index results, so teams can compare harness-plus-model combinations instead of model-only runs.

RELEASE1w ago
TryCua launches Cua Driver Linux with background computer use and Wayland preview

TryCua brought Cua Driver to Linux, letting Claude Code, Codex, Hermes, and custom agents control real desktop apps via CLI or MCP without taking over the main terminal. The release also adds headless SSH execution and a preview of multi-window Wayland control across supported distros.

RELEASE1w ago
Omnigent opens live Claude Code and Codex sessions with phone control

Databricks open-sourced Omnigent, a meta-harness that runs Claude Code, Codex, Cursor, Pi, and custom agents in one live session with a collaborative web UI. The release centralizes supervision, cost control, and cross-agent review instead of splitting work across separate tools.

RELEASE1w ago
Cursor adds cloud handoff from mobile for agents that keep running

Cursor now lets developers move local agents to the cloud so work can continue after the laptop closes, with mobile as the handoff control surface. The change removes one of the main setup frictions in long-running cloud sessions.

NEWS1w ago
Cursor reports a $60B all-stock deal with SpaceX

Cursor said it agreed to a $60B all-stock deal with SpaceX, with closing targeted for Q3 and Cursor remaining a wholly owned subsidiary. The deal ties a major coding-agent channel to SpaceX compute and gives Cursor a new strategic owner.

NEWS1w ago
Anthropic delays Claude Agent SDK credit shift for claude -p and third-party apps

Anthropic paused a same-day policy change that would have moved Claude Agent SDK, claude -p, and third-party SDK apps onto separate monthly credits. Existing subscription-backed workflows continue unchanged for now, but teams should watch for the redesigned billing plan.

NEWS2w ago
Fable users compare GLM-5.2, GPT-5.5, and model panels on one-shot UI work

Two days after Fable 5 went offline, developers started testing GLM-5.2, GPT-5.5, and multi-model panels against the kinds of one-shot frontend and greenfield builds Fable handled well. The early pattern is that replacements cover much of the work, but Fable still leads on UI taste and first-pass product completion.

NEWS2w ago
OpenRouter, OpenCode, and 5 others add Claude Fable 5 on launch day

OpenRouter, OpenCode, Lovable, Cline, Browser Use Terminal, Nous Portal, and Venice all added Fable 5 within hours of launch. The rollouts put the model into gateways, coding agents, browser agents, and chat clients on day one.

RELEASE2w ago
Cohere releases North Mini Code: 30B MoE, 3B active, 256K context

Cohere open-sourced North Mini Code, a 30B-parameter coding MoE with 3B active parameters, 256K context, and Apache 2.0 licensing. OpenCode added it the same day, making the release immediately usable in a coding-agent client.

NEWS2w ago
Cognition benchmarks FrontierCode: top model scores 13% with mergeability grading

Cognition introduced FrontierCode, a coding benchmark that grades mergeability and review quality instead of only unit-test passes, and the top model scored 13%. The result matters because it differs from SWE-Bench-style pass rates, and outside researchers are already questioning score variance and reproducibility.

WORKFLOW3w ago
Agent tooling adds .prose.md programs, PR panes, and exact-edit primitives

Builders shipped OpenProse workflow files, ghzinga PR tabs, cmux terminal controls, datasette-agent-edit primitives, and an agent-optimized CLI fork. These pieces turn prompt strings into reusable files, panes, and testable edit loops for coding agents.

NEWS3w ago
MIT study reports 300% more files but 30% more releases after AI coding adoption

MIT-linked analysis says AI coding tools sharply raise local code output, but most of the gain disappears by review and release. Teams should watch downstream throughput, since project creation rose without matching demand signals in separate Hugging Face Spaces data.

NEWS3w ago
Uber cuts AI coding-tool spend to $1,500 per employee per tool each month

Uber set a $1,500 monthly limit for each AI coding tool an employee uses, covering products such as Cursor and Claude Code. The cap gives enterprises an early benchmark for coding-agent spend as token costs outgrow typical software-seat budgets.

NEWS3w ago
Hyper, OpenCode, Kilo, and Vals add Qwen 3.7 Plus support within 72 hours

Two days after Qwen 3.7 Plus launched, Hyper, OpenCode, Kilo, and Vals shipped support or rankings around the 1M-context multimodal model. The rapid pickup shows Alibaba’s new model landing quickly in coding-agent tools and public eval stacks outside its own platform.

NEWS3w ago
Vals launches ProgramBench: Opus 4.8 solves 2 of 200 software-reconstruction tasks

Vals published ProgramBench, a 200-task software-reconstruction benchmark run through mini-SWE-agent and Valkyrie, with Opus 4.8 becoming the first model to fully solve two tasks. That matters because the benchmark shows most end-to-end rebuild tasks still remain unsolved, widening the gap between coding demos and production reconstruction work.

NEWS3w ago
MiniMax M3 adds OpenCode, Hermes Agent, Atomic Chat, and Vercel AI Gateway support

A day after MiniMax M3 launched, OpenCode, Hermes Agent, Flowith, Atomic Chat, Kilo Code, Cloudflare AI Gateway, and Vercel AI Gateway shipped support. That breadth shows M3 plugged into agent harnesses and routing layers immediately, not just its own API.

NEWS4w ago
Agent tools add Claude Opus 4.8 to Cursor, Warp, OpenRouter, and Perplexity on day one

Independent IDEs, gateways, and agent runtimes rolled out Claude Opus 4.8 within hours of launch, including Cursor, Warp, OpenRouter, and Perplexity. That matters because teams can benchmark or swap the model into existing workflows without waiting for connector lag.

RELEASE4w ago
DeepSWE benchmarks GPT-5.5 at 70% on 113 tasks across 91 repos

DeepSWE launched a coding benchmark built from 113 original tasks across 91 repos and five languages, with GPT-5.5 leading at 70%. The setup is meant to better reflect repo search, multi-file edits, and verification in real agent workflows.

NEWS4w ago
Codex removes GPT-5.2 and GPT-5.3-Codex on June 2

OpenAI said ChatGPT-linked Codex will drop GPT-5.2 and GPT-5.3-Codex on June 2, with GPT-5.5 becoming the default frontier model for free users. The API versions stay available, but the in-product model surface is being reduced for compute-fleet management.

NEWS4w ago
Grok Build Beta adds Toad and Kilo Code integrations plus a web Build tab

xAI broadened Grok Build Beta while Toad and Kilo Code shipped direct support and published concrete build demos. That matters because Grok Build is moving from a standalone beta into terminal, editor, and web workflows engineers can actually wire into daily use.

WORKFLOW4w ago
Developers ship Chrome MCP, repo-graph search, and token compression for Claude Code and Codex

Independent developers released browser-control MCP tooling, repo-context graphing and packaging utilities, and token-compression helpers for coding agents. The cluster matters because agent workflows are now adding browser control, context packing, and cost controls as external infrastructure instead of waiting on raw model upgrades alone.

WORKFLOW1mo ago
Agent Skills ecosystem ships handoff docs, htmx v4 packs, and Project Think support

Independent builders published reusable skills infrastructure across coding agents, including Project Think preview support, handoff docs, and an htmx v4 skill pack. That matters because skills are starting to work like portable workflow units instead of one-off prompt snippets inside a single tool.

RELEASE1mo ago
Zero launches systems language for agents after 3,000 agent tasks

Triangle Company introduced Zero as a systems language aimed at agent-friendly tooling and said the compiler mostly self-hosts after about 3,000 agent tasks in three days. Early inspection praised the tiny C compiler but found broken Mach-O lowering and no fuzz tests, so the release looks experimental rather than production-ready.

NEWS1mo ago
Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.

WORKFLOW1mo ago
Developers launch Agent FM, Mate, and ntm for multi-session Claude Code and Codex control

Independent developers shipped new control-plane tools for long-running coding agents, including Agent FM audio monitoring, Mate phone-first remote control, and ntm for provider-agnostic multi-agent workflows. It matters because teams running many Claude Code and Codex sessions still need better visibility, handoff, and checkpointing than a single built-in session list provides.

NEWS1mo ago
GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use

User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

RELEASE1mo ago
Codex 0.130.0 adds `codex remote-control` and migration support for Code and Cowork

A day after `/goal` and remote-control preview surfaced, Codex 0.130.0 shipped a simpler headless entrypoint while the app’s migration tool added Code and Cowork support. Users also showed Codex handling bug repro, long-running `/goal` sessions, and plugin-driven expense filing, which broadens its role from chat-first coding to delegated workflows.

NEWS1mo ago
Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst

Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.

NEWS1mo ago
Amp Neo reports scaling issues as remote Mac-mini beta reaches airplane Wi-Fi users

Amp paused wider Neo rollout after hitting scaling issues, but beta users still showed remote sessions running from a home Mac mini through the web UI, including over airplane Wi-Fi. That makes Neo notable as a local-hosted coding-agent model, even if the control plane is not yet stable enough for broader access.

NEWS1mo ago
ProgramBench reports 0% on ffmpeg, SQLite, and ripgrep rebuilds without internet

The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.

RELEASE2mo ago
Cursor releases SDK for CI/CD, local or cloud agents, and starter apps

Cursor shipped a TypeScript SDK that exposes its runtime, harness, and models for CI/CD jobs, background automations, and embedded agents. The launch lets teams treat Cursor as programmable agent infrastructure, though it still depends on Cursor API access.

RELEASE2mo ago
Mistral releases Medium 3.5 with 128B weights, 256K context, and Work Mode

Mistral shipped Medium 3.5 as a 128B dense model with 256K context, configurable reasoning, remote agents in Vibe, and Work Mode in Le Chat. The release broadens Mistral’s agent stack, though early comparisons question its price-performance against newer open rivals.

NEWS2mo ago
Tool vendors add GPT-5.5 to Cursor, Databricks, Droid, and ml-intern within 24 hours

Independent tools and platforms shipped GPT-5.5 support within a day of the API rollout, spanning IDEs, hosted research agents, enterprise stacks, and coding agents. That shortens evaluation time because teams can test the model inside existing workflows instead of rebuilding around a single OpenAI surface.

RELEASE2mo ago
Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs

Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.

RELEASE2mo ago
Qwen launches Qwen3.6-Max-Preview on Qwen Chat with AA Index 52

Qwen put Qwen3.6-Max-Preview live on Qwen Chat as an early flagship preview with stronger agentic coding and world-knowledge claims. Early testers report strong first-pass results, but the Max line remains closed rather than open-sourced.

WORKFLOW2mo ago
Codex supports hidden-app control on macOS as users report 38-hour computer-use sessions

Fresh hands-on reports show Codex controlling minimized apps via macOS APIs, using a DOM-aware browser comment mode, and running for day-long sessions in the desktop app. That gives OpenAI stronger evidence that computer use is usable for daily development, though the rollout remains macOS-first and brittle around working-state changes.

RELEASE2mo ago
Ollama supports Hermes Agent in v0.21 with ollama launch hermes

Ollama 0.21 added native Hermes Agent support through the ollama launch hermes command. That makes a self-improving local agent loop available without a hosted inference stack, with memory and skills running on top of Ollama’s model serving.

RELEASE2mo ago
CopilotKit releases A2UI v0.9 with AG-UI support and npx create flow

CopilotKit released A2UI v0.9 for declarative generative UI, where agents emit JSON and frontends render from a component catalog. The update adds AG-UI support, live incremental rendering, and a shared web core across React, Angular, Flutter, and Lit.

RELEASE2mo ago
Qwen3.6-35B-A3B releases Apache 2.0 sparse MoE with 3B active params

Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.

RELEASE2mo ago
Codex adds background computer use on macOS with 90+ plugins and SSH devboxes

OpenAI expanded Codex with background Mac computer use, an in-app browser, image generation, memory preview, automations, and 90+ plugins. The release moves Codex from terminal coding toward long-running UI and ops workflows, though some features remain macOS-first or alpha.

RELEASE2mo ago
Claude Opus 4.7 releases with xhigh effort, /ultrareview, and 3x vision resolution

Claude Opus 4.7 is now generally available across Claude, the API, and major clouds with xhigh effort, higher-resolution vision, and Claude Code review upgrades. Prompt behavior, tokenization, and effort defaults changed enough that existing harnesses may need retuning.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.