Anthropic launches Claude Managed Agents public beta with hosted sandboxes and outcome-based runs
Anthropic put Claude Managed Agents into public beta with hosted sandboxes, vaults, memory filesystems, and long-running sessions. Use the managed setup if you want explicit controls for tools, credentials, and completion criteria instead of custom harness code.

TL;DR
- Anthropic put Claude Managed Agents into public beta as a hosted harness for long-running agent work, and the Claude API release notes frame it as secure sandboxing, built-in tools, and session streaming exposed through new API endpoints.
- Anthropic's architecture diagram and the engineering post split the system into a harness plus four surrounding layers: tools and MCP resources, session state, sandboxed execution, and orchestration.
- The first docs pass is unusually concrete: trq212's thread points to environment templates with package and network controls, while the environment docs say each session gets its own isolated container instance.
- Credentials and completion are first-class objects, not app glue. the vaults docs link covers per-user credential registration, while the outcomes preview adds rubric-based grading so the agent can keep iterating until the artifact passes.
- Anthropic also wired onboarding into Claude Code. According to Lance Martin's walkthrough, the new flow starts with
claude updateand a/claude-api managed-agents-onboardingsubcommand.
You can jump straight to the engineering writeup, skim the managed agents overview, and then fan out into the docs for environments, vaults, and outcomes. The weirdly useful bit is that Anthropic already documents a grader running in a separate context window for outcomes, and Box's launch-day demo shows partners treating this as background automation infrastructure, not a toy agent wrapper.
Harness
Anthropic's launch post sells Managed Agents as the way to stop rebuilding the same scaffolding around every serious agent. The announcement promises a path from prototype to launch, while the engineering blog tweet points to a deeper argument: harnesses encode assumptions that go stale as models improve.
The engineering post says Anthropic tried to design around stable interfaces rather than a fixed harness implementation. That is the interesting claim here. Managed Agents is less a new model feature than Anthropic productizing the control plane around long-horizon runs.
The diagram gives the four nouns that matter:
- Tools and resources / MCP: what the agent can call into.
- Session: the running work state.
- Sandbox: isolated execution.
- Orchestration: the loop coordinating turns, tools, and runtime behavior.
The overview docs make the product split explicit. Messages API is still the direct model interface. Managed Agents is the pre-built harness for long-running and asynchronous work.
Environments
Anthropic exposed the runtime as a reusable environment object instead of burying it inside each request. trq212 called out package selection and networking controls, and the environment docs say you create an environment once, then reference its ID when you start sessions.
That same page adds two practical details that will matter to anyone comparing this with self-hosted agent loops:
- Multiple sessions can share one environment definition.
- Each session still gets its own isolated container instance.
- All requests currently require the
managed-agents-2026-04-01beta header. - Anthropic ships both raw API examples and
ant beta:environments createCLI examples.
The release notes also mention server-sent event streaming, which suggests Anthropic expects developers to watch long jobs as streams rather than poll a thin completion endpoint.
Vaults and memory
The docs push two state primitives alongside the sandbox: credentials and files. Vaults are Anthropic's answer to user-scoped secrets, and trq212's thread context also points readers to filesystem-backed memory for persistence across sessions.
The vaults docs describe vaults as a place to register third-party credentials once, then attach them to sessions by ID. Anthropic explicitly says this avoids sending tokens on every call and helps track which end user an agent is acting for.
The separate memory tool docs describe a persistent directory where Claude can create, read, update, and delete files across sessions. That gives Managed Agents a native way to carry forward learned context without stuffing everything back into the prompt.
Outcomes
The most opinionated part of the launch is outcomes. The docs link in trq212's thread describes an outcome as a target artifact plus a rubric, not another chat turn.
The outcomes page says the harness provisions a grader automatically, and that grader runs in a separate context window from the main agent. It returns a criterion-by-criterion breakdown of what passed and what is still missing.
That moves the product a step closer to workflow execution than chat orchestration. Anthropic is not just hosting an agent loop, it is hosting the judge for when the loop can stop.
Claude Code onboarding
Anthropic also used Claude Code as the front door. According to Lance Martin's launch thread, the easiest starting path is updating Claude Code and running a managed-agents onboarding command from inside the CLI.
The quickstart mirrors that product shape with three core objects:
- Agent: model, system prompt, tools, MCP servers, and skills.
- Environment: container template with packages and network access.
- Session: a running agent instance for a specific task.
That is a neat packaging move. Anthropic is turning the same agent ergonomics it taught in Claude Code into an API product with hosted runtime underneath.
Box
The first partner demo already points at the workload Anthropic wants. Aaron Levie's post shows Box wiring Managed Agents into document review, data extraction, and content workflows through the Box API or MCP.
That matters because it is more specific than the launch post's generic "agents at scale" line. The earliest public example is background knowledge work over enterprise content, with Box pitching setup in minutes rather than months of infrastructure buildout.