Skip to content
AI Primer
TOPIC50 stories

Coding Agents

Umbrella tag for the coding-agent space as a category. Prefer the narrower sub-tags agent-product-launch or agent-pattern. Reserve this tag for category-level / market-level stories that span multiple products.

WORKFLOW11th May
Developers launch Agent FM, Mate, and ntm for multi-session Claude Code and Codex control

Independent developers shipped new control-plane tools for long-running coding agents, including Agent FM audio monitoring, Mate phone-first remote control, and ntm for provider-agnostic multi-agent workflows. It matters because teams running many Claude Code and Codex sessions still need better visibility, handoff, and checkpointing than a single built-in session list provides.

NEWS11th May
Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.

NEWS9th May
GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use

User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

NEWS9th May
Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst

Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.

RELEASE9th May
Codex 0.130.0 adds `codex remote-control` and migration support for Code and Cowork

A day after `/goal` and remote-control preview surfaced, Codex 0.130.0 shipped a simpler headless entrypoint while the app’s migration tool added Code and Cowork support. Users also showed Codex handling bug repro, long-running `/goal` sessions, and plugin-driven expense filing, which broadens its role from chat-first coding to delegated workflows.

NEWS9th May
Amp Neo reports scaling issues as remote Mac-mini beta reaches airplane Wi-Fi users

Amp paused wider Neo rollout after hitting scaling issues, but beta users still showed remote sessions running from a home Mac mini through the web UI, including over airplane Wi-Fi. That makes Neo notable as a local-hosted coding-agent model, even if the control plane is not yet stable enough for broader access.

NEWS1w ago
ProgramBench reports 0% on ffmpeg, SQLite, and ripgrep rebuilds without internet

The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.

RELEASE2w ago
Cursor releases SDK for CI/CD, local or cloud agents, and starter apps

Cursor shipped a TypeScript SDK that exposes its runtime, harness, and models for CI/CD jobs, background automations, and embedded agents. The launch lets teams treat Cursor as programmable agent infrastructure, though it still depends on Cursor API access.

RELEASE2w ago
Mistral releases Medium 3.5 with 128B weights, 256K context, and Work Mode

Mistral shipped Medium 3.5 as a 128B dense model with 256K context, configurable reasoning, remote agents in Vibe, and Work Mode in Le Chat. The release broadens Mistral’s agent stack, though early comparisons question its price-performance against newer open rivals.

NEWS2w ago
Tool vendors add GPT-5.5 to Cursor, Databricks, Droid, and ml-intern within 24 hours

Independent tools and platforms shipped GPT-5.5 support within a day of the API rollout, spanning IDEs, hosted research agents, enterprise stacks, and coding agents. That shortens evaluation time because teams can test the model inside existing workflows instead of rebuilding around a single OpenAI surface.

RELEASE3w ago
Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs

Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.

RELEASE3w ago
Qwen launches Qwen3.6-Max-Preview on Qwen Chat with AA Index 52

Qwen put Qwen3.6-Max-Preview live on Qwen Chat as an early flagship preview with stronger agentic coding and world-knowledge claims. Early testers report strong first-pass results, but the Max line remains closed rather than open-sourced.

WORKFLOW3w ago
Codex supports hidden-app control on macOS as users report 38-hour computer-use sessions

Fresh hands-on reports show Codex controlling minimized apps via macOS APIs, using a DOM-aware browser comment mode, and running for day-long sessions in the desktop app. That gives OpenAI stronger evidence that computer use is usable for daily development, though the rollout remains macOS-first and brittle around working-state changes.

RELEASE3w ago
CopilotKit releases A2UI v0.9 with AG-UI support and npx create flow

CopilotKit released A2UI v0.9 for declarative generative UI, where agents emit JSON and frontends render from a component catalog. The update adds AG-UI support, live incremental rendering, and a shared web core across React, Angular, Flutter, and Lit.

RELEASE3w ago
Ollama supports Hermes Agent in v0.21 with ollama launch hermes

Ollama 0.21 added native Hermes Agent support through the ollama launch hermes command. That makes a self-improving local agent loop available without a hosted inference stack, with memory and skills running on top of Ollama’s model serving.

RELEASE4w ago
Codex adds background computer use on macOS with 90+ plugins and SSH devboxes

OpenAI expanded Codex with background Mac computer use, an in-app browser, image generation, memory preview, automations, and 90+ plugins. The release moves Codex from terminal coding toward long-running UI and ops workflows, though some features remain macOS-first or alpha.

RELEASE4w ago
Claude Opus 4.7 releases with xhigh effort, /ultrareview, and 3x vision resolution

Claude Opus 4.7 is now generally available across Claude, the API, and major clouds with xhigh effort, higher-resolution vision, and Claude Code review upgrades. Prompt behavior, tokenization, and effort defaults changed enough that existing harnesses may need retuning.

RELEASE4w ago
Qwen3.6-35B-A3B releases Apache 2.0 sparse MoE with 3B active params

Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.

RELEASE4w ago
OpenAI Agents SDK adds sandbox execution and memory controls with Vercel, Modal, E2B and Daytona

OpenAI updated the Agents SDK with sandbox execution, memory controls and run snapshotting, and launch partners Vercel, Modal, E2B and Daytona shipped integrations. Long-running agents can now keep files, credentials and execution state in isolated runtimes instead of wiring harness, compute and storage layers together manually.

RELEASE4w ago
Windsurf 2.0 integrates Devin for cloud agents that keep running after the IDE closes

Windsurf 2.0 launched with Devin embedded into the product, combining local agents with cloud agents that can continue across codebases after you close the laptop. The IDE now acts as a handoff layer between interactive edits and long-running remote execution.

NEWS4w ago
Claude Code ships Routines in research preview with API and webhook triggers

Anthropic introduced Claude Code Routines, a cloud-run automation layer that can execute on schedules, API calls, and GitHub events. The rollout moves scheduling from local runs to hosted, persistent automation and adds new trigger surfaces for plan-wide use.

RELEASE4w ago
Claude Code updates desktop app with side-by-side sessions and integrated terminal

Anthropic rebuilt Claude Code on desktop into a drag-and-drop multi-session workspace with file editing, HTML and PDF preview, and sidebar session management. The same rollout also shipped 2.1.108 features, including an optional 1-hour cache TTL, recap, and new built-ins that affect cost and session handoff.

RELEASE4w ago
Hermes Agent releases v0.9.0 with a local dashboard and monitoring APIs

Nous Research shipped Hermes Agent v0.9.0 with a local web dashboard, new monitoring APIs, and broader platform updates. Teams using multi-agent workflows should test the new controls for profile cloning and long-running dashboard-managed sessions.

RELEASE4w ago
Open Agents launches a browser-based cloud coding platform with parallel sessions

Open Agents open-sources a browser-based cloud coding platform that keeps sessions running in parallel after a laptop closes. Use the reference stack if you want sandboxed VMs, model routing, and durable execution for internal coding-agent systems.

RELEASE4w ago
Cursor updates Cursor 3 with split agents and 87% fewer dropped frames

Cursor 3 adds split-agent panes, tighter cloud-agent controls, voice input fixes, and an 87% reduction in dropped frames during large edits. The update makes the IDE easier to use as a mixed local-cloud agent workspace, while keeping editor navigation and diff review intact.

NEWS4w ago
Claude Code reports Opus 4.6 quality drop as BridgeBench retest falls to 68.3%

Fresh retests and issue threads point to worse Claude Code behavior, with Opus 4.6 falling to 68.3% on BridgeBench and users surfacing buried reasoning-effort controls. Track quota burn, hidden effort settings, and rollback reports before assigning more coding-agent work.

RELEASE4w ago
Hermes Agent adds /debug log sharing and automatic OpenClaw import

Hermes Agent shipped automatic OpenClaw migration, pastebin log sharing, and a reported 20% improvement in loading the right skill. Use the new import path and debug sharing to simplify setup across the official and community add-ons now covering support, web UI, workspace boards, and chat front ends.

NEWS4w ago
Hermes Agent ranks #1 on OpenRouter for coding apps

Nous said Hermes became the top coding app on OpenRouter while shipping an OpenClaw migration patch, Telegram agent-to-agent messaging, and new memory controls. If you run long-lived agents, watch the migration path and memory settings before moving chats or skills hubs.

RELEASE4w ago
MiniMax releases M2.7 open model with 56.22% SWE-Pro and 57.0% Terminal Bench 2

MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.

NEWS4w ago
Meerkat reports harness-level cheating across 28+ submissions on nine agent benchmarks

Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.

RELEASE4w ago
Codex 0.120 adds per-project memory extensions and Realtime V2 streaming

Codex 0.120 introduced per-project memory extension files and Realtime V2 progress streaming for background agents. Separate app findings also showed an unreleased Scratchpad view that can start parallel Codex chats from a task list, which may change how teams queue work.

NEWS4w ago
Vercel Sandbox benchmarks sub-500 ms node -v cold starts

Vercel said Sandbox is now the fastest microVM-based runtime, with fresh node -v cold starts now largely under 500 ms after a month of tuning. The update also puts persistent sandboxes into beta and expands plans for a programmable firewall, so teams should re-check runtime and security settings.

NEWS4w ago
MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation

Epoch AI and METR introduced MirrorCode, a long-horizon benchmark where models reimplement software from execution-only access; Opus 4.6 completed a 16,000-line bioinformatics toolkit. The authors say oracle tests and memorization risks still limit how directly the result maps to everyday software work.

NEWS4w ago
GLM-5.1 ranks #3 on Code Arena

Arena ranked GLM-5.1 third on Code Arena and first among open models, putting it on par with Claude Sonnet 4.6 and within about 20 points of the overall lead. The update gives the open model a new frontier coding benchmark after its initial release and hosting wave.

RELEASE4w ago
Claude Code launches /ultraplan preview with web planning and cloud execution

Anthropic launched /ultraplan, moving Claude Code planning into a web review flow with cloud execution or terminal handoff. Claude Code 2.1.101 also adds OS certificate-store trust by default, a command-injection fix, and new prompt rules for browser validation and prompt caching.

NEWS4w ago
ClawShop launches OpenClaw resources with SecretRef and PinchBench

Kilo Code’s ClawShop recap bundled a 30-minute KiloClaw setup workshop, SecretRef credential handling, searchable ClawBytes guides, and PinchBench for agentic performance. The event, OpenClaw 2026.4.10, and PetClaw together added new security, memory, budgeting, and desktop layers around the OpenClaw stack.

RELEASE4w ago
Qwen Code updates v0.14.2 with Channels, Cron Jobs, and Qwen3.6-Plus

Qwen Code added phone-based control via Telegram, DingTalk, and WeChat, scheduled agent loops, per-subagent model selection, and a planning mode before execution. The release also centers Qwen3.6-Plus, which Alibaba says offers 1M context and 1,000 free daily requests, while Vals ranked the model #17 overall and #11 multimodal.

RELEASE1mo ago
Anthropic adds beta advisor tool to Messages API for Opus calls

Anthropic added a beta advisor tool to the Messages API so Sonnet or Haiku can call Opus mid-run inside one request. Anthropic says Sonnet plus Opus scored 2.7 points higher on SWE-bench Multilingual while cutting per-task cost 11.9%.

NEWS1mo ago
OpenAI launches $100 ChatGPT Pro tier with 5x more Codex usage

OpenAI added a $100 ChatGPT Pro tier with 5x more Codex usage than Plus and kept the $200 tier as the highest-capacity option. The new tier resets Codex limits again and temporarily doubles Pro usage through May 31.

RELEASE1mo ago
Anthropic launches Claude Managed Agents public beta with hosted sandboxes and outcome-based runs

Anthropic put Claude Managed Agents into public beta with hosted sandboxes, vaults, memory filesystems, and long-running sessions. Use the managed setup if you want explicit controls for tools, credentials, and completion criteria instead of custom harness code.

RELEASE1mo ago
Hermes Agent updates to v0.8.0 with Browser Use, remote backends, and worktree parallelism

Hermes Agent v0.8.0 added remote code-execution backends, Browser Use cloud browsing, prompt caching, shared sessions, and CLI workflow upgrades like `hermes -w`. Try the new browser-backed and parallel execution paths if you need more persistent, multi-provider agent runs.

NEWS1mo ago
GLM-5.1 lands on Modal, Together AI, Letta Code, and Tembo

Providers and agent platforms added GLM-5.1 endpoints across Modal, Together AI, Letta Code, Tembo, and Tabbit, with free trials, no-key access, and 99.9% SLA options. Use the new hosting options to test the model for coding and long-horizon agent workloads without waiting on self-hosting.

NEWS1mo ago
OpenAI resets Codex usage limits after 3 million weekly users

OpenAI said Codex reached 3 million weekly users and reset usage limits, with another reset planned for each additional million users up to 10 million. ChatGPT-sign-in Codex will also retire the gpt-5.2 and gpt-5.1-era lineup on April 14, so teams should watch for model-default changes.

RELEASE1mo ago
Z.ai releases GLM-5.1, a 744B open model with 58.4 SWE-Bench Pro and 8-hour agent runs

Z.ai released GLM-5.1, a 744B open model built for long-horizon agentic coding and ranked first among open systems on SWE-Bench Pro. Day-0 support in OpenRouter, Ollama, SGLang, vLLM, OpenCode, and local quantization paths makes it ready to test in existing stacks.

NEWS1mo ago
Hermes Agent adds MiniMax M2.7 and MiMo V2 Pro through partner integrations

Nous Research added MiniMax M2.7, Xiaomi’s MiMo V2 Pro, a SuperMemory plugin, and expanded Manim support to Hermes through partner integrations. The additions give users new hosted model options, a shared memory backend, and more complete technical-animation tooling to try in workflows.

RELEASE1mo ago
OpenClaw 2026.4.7 adds a headless inference hub, memory-wiki, and webhook TaskFlows

OpenClaw 2026.4.7 adds a headless inference hub, memory-wiki, session branch and restore, and webhook-driven TaskFlows. Composio also shipped a CLI for secure app authentication, so users can expand OpenClaw from a local coding harness into a broader agent runtime.

WORKFLOW1mo ago
Bram Cohen compares vibe coding with AI Level 6 workflows after Claude Code leak

Bram Cohen used the Claude Code leak to argue that prompt-only development produces bad software, while a separate 250-hour syntaqlite build said the durable version arrived only after a Python-to-Rust rewrite. Practitioners say specs, tests, linters, repo skills, and codebase context are the controls that keep coding agents maintainable.

NEWS1mo ago
OpenClaw adds direct Claude Code and ClawHub listener routes

Builders shipped a direct Claude Code harness and a ClawHub marketplace skill for OpenClaw workflows. Use these routes to wire agent tooling into OpenClaw, but watch Claude API limits and token burn costs.

NEWS1mo ago
Anthropic cuts Claude subscription access for third-party harnesses in Apr. 4 rollout

Anthropic’s Apr. 4 cutoff for using Claude subscriptions through OpenClaw-class harnesses went live. Users report API-billing fallbacks, ACP workarounds, and restored Claude Code quota, while edge cases around claude -p and Agent SDK use remain unsettled. The change pushes heavy agent loops toward metered access.

RELEASE1mo ago
Hermes Agent adds /claude-code orchestration and cron hooks

Hermes Agent added direct /claude-code orchestration and cron-time script hooks, and the team also shipped Hermes-focused datasets and agent-tuned model variants. The update turns Hermes into a harness that can steer Claude Code and inject recurring context automatically.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.