Observability
Tracing, logging, monitoring, and diagnosis for AI systems.
Stories
Filter storiesClaude Code 2.1.193 routes all shell commands through auto-mode classification, adds live file path autocomplete in bash mode, and can emit assistant-response OpenTelemetry events. It also changes denial logging and response-logging defaults for teams instrumenting the CLI.
Latitude released an open-source platform for monitoring AI agents in production, with plain-English trace search, repeated-failure clustering, and MCP access from coding agents. That gives teams a self-hostable way to inspect token burn, surface recurring failures, and turn production traces into evals and fixes.
Nous shipped Hermes Agent v0.16.0 with a desktop GUI, a rebuilt browser dashboard, remote auth options, and full Simplified Chinese UI coverage. The release moves Hermes beyond a terminal-only workflow and into a broader admin and desktop control surface.
LangSmith added sandboxed execution, spend-aware gateway routing, and Engine to surface recurring agent failures from traces. The bundle gives teams one place to run agents, control token spend, and turn production issues into debugging and eval loops.
Prime Intellect launched Hosted Evaluations to manage harnesses, sandboxes, and rollout inspection for model testing. The service packages eval infrastructure while still supporting local runs against arbitrary engines, so teams can centralize testing without losing flexibility.
Weights & Biases released an MCP server that exposes experiment data to Claude Code, Cursor, Codex, Gemini CLI, and Le Chat. The schema-first design helps agents inspect available metrics before pulling rows, which can prevent preview runs from overflowing context windows.
Lovable described a production loop where an is_stuck classifier detects repeated failures, Overflow injects past solution pairs, and send_feedback escalates real tool failures. The system lowered stuck rate 5% and raised publish rate 2%, so teams can use the same signal to debug outages and agent frustration.
Anthropic released Claude Code 2.1.145 with JSON session listing for scripting, Bash execution inside Tool, and richer OTEL span metadata. Update if you rely on automation, and review the fix for the environment-variable approval bypass plus the UI bug fixes.
Cognition launched Devin Auto-Triage to watch issues across Slack, Linear, GitHub, schedules, webhooks, and observability tools. Teams can use it as an always-on investigation flow that returns context, next steps, or a PR.
Claude Console now shows which message, system prompt, tool, or model change caused a cache miss and how many tokens it cost. That matters because teams can trace prompt-cost regressions to specific edits instead of debugging cache churn blind.
Kilo Code posted two cloud-agent automations: a webhook-driven CVE patch flow that opens PRs in parallel and a post-deploy smoke test that checks health, 2xx responses, and latency under 2 seconds. This matters because the examples show coding agents moving into CI-style remediation and production verification loops.
Workshop open-sourced a local agent debugging tool that exposes traces for humans, Codex, and Claude Code with a one-line install and GitHub repo. It turns agent runs into something teams can inspect and reuse for evals instead of treating terminal sessions as black boxes.
LangChain unveiled SmithDB, LangSmith Engine, Managed Deep Agents, and GA sandboxes at Interrupt. The stack gives agent teams a purpose-built trace database, autonomous failure triage, and managed execution environments for production workflows.
Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.
Claude Code 2.1.132 added env vars to keep native terminal scrollback and to pass session IDs into Bash subprocesses, plus graceful shutdown fixes. It also moved risky-action confirmation earlier in the system prompt and changed tracing behavior for hooks.
Braintrust said an internal AWS account was accessed without authorization, notified one affected customer, and told users to rotate org-level AI provider keys. The incident matters because teams storing shared model credentials in Braintrust may need immediate secret rotation while the investigation continues.
Raindrop launched Triage, a Slack-based agent that finds traces, summarizes recurring failures, runs recurring briefs, and opens experiments from production conversations. Teams using Claude Code, Cursor, or Devin can plug it into agent ops to shorten debugging loops.
OpenRouter added response caching across chat, responses, messages, and embeddings with per-key isolation, TTL controls, and cached stream replay. The beta matters because identical retries and test runs can return in milliseconds without provider charges or rate-limit hits.
ml-intern now lets an agent run long post-training tasks like parallel ablations in YOLO mode and automatically pushes session traces to a Hub account for later inspection. That gives RL and fine-tuning workflows both unattended execution and a built-in audit trail.
Mistral Studio added a Workflows orchestration layer that tracks state, retries, branches, and human approvals in public preview. That lets long-running agent flows resume after failures instead of restarting from scratch.
OpenRouter introduced Workspaces to separate API keys, BYOK, routing, plugins, and observability by environment or team. Billing stays unified at the account level while staging and production settings split cleanly.
PlayerZero launched an AI production engineer and claims its world model can simulate failures before release, trace incidents to exact PRs, and beat existing tools on real production test cases. If those numbers hold, the interesting shift is from code generation to debugging, testing, and observability after code ships.
LangChain published a free course on taking agents from first run to production-ready systems with LangSmith loops for observability and evals. The timing lines up with new NVIDIA integration messaging, so teams can study process and stack choices together.
LangSmith Fleet introduces shared agents with edit and run permissions, agent identity, human approvals, and tracing. That matters because enterprise agent rollout is shifting from single-user demos to governed, auditable deployment surfaces.
OpenAI described an internal system that uses its strongest models to review almost all coding-agent traffic for misalignment and suspicious behavior. It is a sign that powerful internal agents may need continuous oversight, not just pre-deployment policy checks.
LangChain rebranded Agent Builder to Fleet and added agent identity, memory, sharing controls, and LangSmith tracing for multi-user agent operations. It gives teams a governed way to deploy Slack- and GitHub-connected agents without stitching auth and auditing together by hand.
Intercom detailed an internal Claude Code platform with plugin hooks, production-safe MCP tools, telemetry, and automated feedback loops that turn sessions into new skills and GitHub issues. The patterns are useful if you are standardizing coding agents across engineering, support, and product teams.
W&B shipped robotics-focused evaluation views including synchronized video playback, pinned run baselines, semantic coloring, and side-by-side media comparisons. These tools matter if your model outputs are videos or trajectories and loss curves alone hide failure modes.
Weights & Biases shipped an iOS app that lets teams watch live metrics and receive crash alerts without staying at a laptop. Install it if you need training and eval failures to surface on the phone that already handles your paging flow.
Together GPU Clusters added autoscaling, RBAC, observability, and self-healing controls to its managed cluster product. Use it if your team is moving from ad hoc GPU pools to production training or inference and needs more platform controls out of the box.