Agent Readiness
How prepared a codebase and environment are for agents.
Stories
Filter storiesPerceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.
METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.
DeepSeek began rolling out Vision beta as a new image-understanding mode in Chat, and early testers reported fast OCR and strong object recognition. The rollout appears limited or staggered, so watch for broader access and formal docs before relying on it.
Plurai launched vibe-training to turn natural-language intents into task-specific eval and guardrail APIs backed by small models. That matters because it positions SLM-based checks as a faster, cheaper alternative to frontier LLM judges for production agents.
Anthropic put memory into public beta for Claude Managed Agents, storing retained context as files developers can export and edit. The change lets agent state persist across sessions without a separate memory service.
Google DeepMind shipped Gemini Robotics-ER 1.6 to the Gemini API and AI Studio with better visual-spatial reasoning, multi-view success detection, and gauge reading. The model's 93% instrument-reading score targets robots that need to reason over cluttered scenes and physical constraints.
LangChain launched Deep Agents Deploy in beta as a production path for open, model-agnostic agent harnesses configured with AGENTS.md, skills, and mcp.json. Deployments run on LangSmith and can expose MCP, A2A, and agent protocol while teams choose models and sandbox providers.
New third-party tests put MiniMax M2.7 at a 34% hallucination rate, roughly 65 tps, and 27.04% on Vibe Code Bench while users pushed it through physics-heavy web demos. It looks increasingly viable for agent workflows, but performance still swings by task and harness.
Google DeepMind and Kaggle opened a global challenge to build cognitive benchmarks across learning, metacognition, attention, executive function, and social cognition. Join if you work on evals and want reusable tasks with human baselines instead of another saturated leaderboard.
Manus moved from a cloud sandbox onto local machines with My Computer, a desktop app that can organize files, run commands, and build apps on macOS and Windows. Use it if you want agent workflows over private local data and hardware instead of a remote browser sandbox.
Factory released an analytics layer for teams deploying coding agents, surfacing usage, tool calls, activity, and productivity from tokens through pull requests. Use it if you need ROI, readiness, and cost visibility as agent adoption scales.
supermemory launched a CLI that exposes platform actions directly to agents and added scoped agent access with tag-level permissions plus audit logs. Use it to wire memory into agent loops without granting a full account.
Third-party MRCR v2 results put Claude Opus 4.6 at a 78.3% match ratio at 1M tokens, ahead of Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. If you are testing long-context agents, measure retrieval quality and task completion, not just advertised context window size.
Markov AI released Computer Use Large on Hugging Face with 48,478 screen recordings spanning about 12,300 hours across six professional apps. Use it to train and evaluate GUI agents on real software workflows with a large CC-BY dataset.
Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.
NVIDIA released Nemotron 3 Super, a 120B open model with 1M-token context and a hybrid architecture tuned for agent workloads, then landed it in Perplexity and Baseten. Try it if you need an open-weight long-context option that is already available in hosted stacks.
Meta acquired Moltbook and is bringing its founders into Meta Superintelligence Labs as it bets on agent identity and social coordination layers. Watch how Meta productizes registry, verification, and cross-agent discovery for agent ecosystems.
Nous Research released a self-evolution package for Hermes Agent that uses DSPy and GEPA to optimize skills, prompts, and code, and reported a phase-one score increase from 0.408 to 0.569 on one skill. Agent teams can study the repo for fallback model, memory, and self-improvement loop patterns.