TOPIC23 stories

Agent Readiness

How prepared a codebase and environment are for agents.

Stories

Researchers benchmark AutoLab, SkillOpt, and Meta-Agent Challenge for self-improving agents

New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.

NEWS3w ago

MIT study reports 300% more files but 30% more releases after AI coding adoption

MIT-linked analysis says AI coding tools sharply raise local code output, but most of the gain disappears by review and release. Teams should watch downstream throughput, since project creation rose without matching demand signals in separate Hugging Face Spaces data.

NEWS3w ago

Arena launches Agent Mode rankings with GPT-5.5 High leading

Arena shipped Agent Mode, a benchmark that lets models use web search, bash, file writing, image generation, and follow-up questions, then ranks them on five live-session signals. It matters because agent evals move from static task sets to real user workflows, with GPT-5.5 High currently leading the leaderboard.

RELEASE4w ago

Hyperbrowser launches AgentRank to test Claude, GPT, and Gemini on real websites

Hyperbrowser launched AgentRank, an open-source tool that runs Claude, GPT, and Gemini agents against a site to show where they get stuck. It matters because teams can turn agent website compatibility into a repeatable eval instead of an anecdotal demo.

RELEASE1mo ago

Qwen3.7 Max launches with 1M context, 35-hour autonomy, and 56.6 AA Index

Alibaba launched Qwen3.7 Max as its new flagship agent model with 1M context, stronger coding and reasoning scores, and cross-harness benchmarks. OpenRouter, Together, AI Gateway, and Kilo support it on day one, making it ready for immediate deployment.

RELEASE1mo ago

Perceptron releases Mk1 with 2 FPS video reasoning, 32K context, and $0.15 per 1M input

Perceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.

NEWS1mo ago

METR says Claude Mythos Preview hits 16-hour p50 Horizon in early snapshot

METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.

RELEASE2mo ago

DeepSeek releases Vision beta for image understanding in DeepSeek Chat

DeepSeek began rolling out Vision beta as a new image-understanding mode in Chat, and early testers reported fast OCR and strong object recognition. The rollout appears limited or staggered, so watch for broader access and formal docs before relying on it.

NEWS2mo ago

Plurai introduces vibe-training with sub-100ms agent guardrails and 43% fewer failures

Plurai launched vibe-training to turn natural-language intents into task-specific eval and guardrail APIs backed by small models. That matters because it positions SLM-based checks as a faster, cheaper alternative to frontier LLM judges for production agents.

RELEASE2mo ago

Claude Managed Agents adds memory in public beta with file-backed session state

Anthropic put memory into public beta for Claude Managed Agents, storing retained context as files developers can export and edit. The change lets agent state persist across sessions without a separate memory service.

RELEASE2mo ago

Google DeepMind releases Gemini Robotics-ER 1.6 with 93% instrument reading

Google DeepMind shipped Gemini Robotics-ER 1.6 to the Gemini API and AI Studio with better visual-spatial reasoning, multi-view success detection, and gauge reading. The model's 93% instrument-reading score targets robots that need to reason over cluttered scenes and physical constraints.

RELEASE2mo ago

LangChain launches Deep Agents Deploy beta with AGENTS.md and mcp.json

LangChain launched Deep Agents Deploy in beta as a production path for open, model-agnostic agent harnesses configured with AGENTS.md, skills, and mcp.json. Deployments run on LangSmith and can expose MCP, A2A, and agent protocol while teams choose models and sandbox providers.

NEWS3mo ago

MiniMax M2.7 benchmarks 34% hallucination rate on new tests

New third-party tests put MiniMax M2.7 at a 34% hallucination rate, roughly 65 tps, and 27.04% on Vibe Code Bench while users pushed it through physics-heavy web demos. It looks increasingly viable for agent workflows, but performance still swings by task and harness.

NEWS3mo ago

Google DeepMind launches Kaggle benchmark contest with $200k to measure AGI capabilities

Google DeepMind and Kaggle opened a global challenge to build cognitive benchmarks across learning, metacognition, attention, executive function, and social cognition. Join if you work on evals and want reusable tasks with human baselines instead of another saturated leaderboard.

RELEASE3mo ago

Manus launches My Computer for local macOS and Windows control

Manus moved from a cloud sandbox onto local machines with My Computer, a desktop app that can organize files, run commands, and build apps on macOS and Windows. Use it if you want agent workflows over private local data and hardware instead of a remote browser sandbox.

RELEASE3mo ago

Factory launches Analytics to tie tokens, tool calls, commits, and PRs to software output

Factory released an analytics layer for teams deploying coding agents, surfacing usage, tool calls, activity, and productivity from tokens through pull requests. Use it if you need ROI, readiness, and cost visibility as agent adoption scales.

RELEASE3mo ago

supermemory launches CLI with npx install, scoped agent access, and audit logs

supermemory launched a CLI that exposes platform actions directly to agents and added scoped agent access with tag-level permissions plus audit logs. Use it to wire memory into agent loops without granting a full account.

NEWS3mo ago

Claude Opus 4.6 ranks 78.3% on MRCR v2 at 1M tokens

Third-party MRCR v2 results put Claude Opus 4.6 at a 78.3% match ratio at 1M tokens, ahead of Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. If you are testing long-context agents, measure retrieval quality and task completion, not just advertised context window size.

RELEASE3mo ago

Markov AI releases Computer Use Large on Hugging Face: 48,478 videos and 12,300 hours

Markov AI released Computer Use Large on Hugging Face with 48,478 screen recordings spanning about 12,300 hours across six professional apps. Use it to train and evaluate GUI agents on real software workflows with a large CC-BY dataset.

NEWS3mo ago

Tiiny claims pocket AI server runs local 120B models with an OpenAI-compatible API

Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.

RELEASE3mo ago

NVIDIA releases Nemotron 3 Super: 120B open model targets 1M-token agent workloads

NVIDIA released Nemotron 3 Super, a 120B open model with 1M-token context and a hybrid architecture tuned for agent workloads, then landed it in Perplexity and Baseten. Try it if you need an open-weight long-context option that is already available in hosted stacks.

NEWS3mo ago

Meta adds Moltbook to Meta Superintelligence Labs in deal closing mid-March

Meta acquired Moltbook and is bringing its founders into Meta Superintelligence Labs as it bets on agent identity and social coordination layers. Watch how Meta productizes registry, verification, and cross-agent discovery for agent ecosystems.

RELEASE3mo ago

Hermes Agent introduces self-evolution with a reported 39.5% quality gain

Nous Research released a self-evolution package for Hermes Agent that uses DSPy and GEPA to optimize skills, prompts, and code, and reported a phase-one score increase from 0.408 to 0.569 on one skill. Agent teams can study the repo for fallback model, memory, and self-improvement loop patterns.