How prepared a codebase and environment are for agents.
New third-party tests put MiniMax M2.7 at a 34% hallucination rate, roughly 65 tps, and 27.04% on Vibe Code Bench while users pushed it through physics-heavy web demos. It looks increasingly viable for agent workflows, but performance still swings by task and harness.
Google DeepMind and Kaggle opened a global challenge to build cognitive benchmarks across learning, metacognition, attention, executive function, and social cognition. Join if you work on evals and want reusable tasks with human baselines instead of another saturated leaderboard.
Manus moved from a cloud sandbox onto local machines with My Computer, a desktop app that can organize files, run commands, and build apps on macOS and Windows. Use it if you want agent workflows over private local data and hardware instead of a remote browser sandbox.
Factory released an analytics layer for teams deploying coding agents, surfacing usage, tool calls, activity, and productivity from tokens through pull requests. Use it if you need ROI, readiness, and cost visibility as agent adoption scales.
supermemory launched a CLI that exposes platform actions directly to agents and added scoped agent access with tag-level permissions plus audit logs. Use it to wire memory into agent loops without granting a full account.
Third-party MRCR v2 results put Claude Opus 4.6 at a 78.3% match ratio at 1M tokens, ahead of Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. If you are testing long-context agents, measure retrieval quality and task completion, not just advertised context window size.
Markov AI released Computer Use Large on Hugging Face with 48,478 screen recordings spanning about 12,300 hours across six professional apps. Use it to train and evaluate GUI agents on real software workflows with a large CC-BY dataset.
Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.
NVIDIA released Nemotron 3 Super, a 120B open model with 1M-token context and a hybrid architecture tuned for agent workloads, then landed it in Perplexity and Baseten. Try it if you need an open-weight long-context option that is already available in hosted stacks.
Meta acquired Moltbook and is bringing its founders into Meta Superintelligence Labs as it bets on agent identity and social coordination layers. Watch how Meta productizes registry, verification, and cross-agent discovery for agent ecosystems.
Nous Research released a self-evolution package for Hermes Agent that uses DSPy and GEPA to optimize skills, prompts, and code, and reported a phase-one score increase from 0.408 to 0.569 on one skill. Agent teams can study the repo for fallback model, memory, and self-improvement loop patterns.