Agent Pattern
How-to / design-pattern / best-practice stories about building or operating coding agents (delegation depth, harness design, control surfaces).
Stories
Filter storiesOpenAI updated the Agents SDK with sandbox execution, memory controls and run snapshotting, and launch partners Vercel, Modal, E2B and Daytona shipped integrations. Long-running agents can now keep files, credentials and execution state in isolated runtimes instead of wiring harness, compute and storage layers together manually.
Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.
Vercel said Sandbox is now the fastest microVM-based runtime, with fresh node -v cold starts now largely under 500 ms after a month of tuning. The update also puts persistent sandboxes into beta and expands plans for a programmable firewall, so teams should re-check runtime and security settings.
Kilo Code’s ClawShop recap bundled a 30-minute KiloClaw setup workshop, SecretRef credential handling, searchable ClawBytes guides, and PinchBench for agentic performance. The event, OpenClaw 2026.4.10, and PetClaw together added new security, memory, budgeting, and desktop layers around the OpenClaw stack.
Anthropic put Claude Managed Agents into public beta with hosted sandboxes, vaults, memory filesystems, and long-running sessions. Use the managed setup if you want explicit controls for tools, credentials, and completion criteria instead of custom harness code.
OpenClaw 2026.4.7 adds a headless inference hub, memory-wiki, session branch and restore, and webhook-driven TaskFlows. Composio also shipped a CLI for secure app authentication, so users can expand OpenClaw from a local coding harness into a broader agent runtime.
Bram Cohen used the Claude Code leak to argue that prompt-only development produces bad software, while a separate 250-hour syntaqlite build said the durable version arrived only after a Python-to-Rust rewrite. Practitioners say specs, tests, linters, repo skills, and codebase context are the controls that keep coding agents maintainable.
Builders shipped a direct Claude Code harness and a ClawHub marketplace skill for OpenClaw workflows. Use these routes to wire agent tooling into OpenClaw, but watch Claude API limits and token burn costs.
Anthropic’s Apr. 4 cutoff for using Claude subscriptions through OpenClaw-class harnesses went live. Users report API-billing fallbacks, ACP workarounds, and restored Claude Code quota, while edge cases around claude -p and Agent SDK use remain unsettled. The change pushes heavy agent loops toward metered access.
Hermes Agent added direct /claude-code orchestration and cron-time script hooks, and the team also shipped Hermes-focused datasets and agent-tuned model variants. The update turns Hermes into a harness that can steer Claude Code and inject recurring context automatically.
Imbue published a walkthrough for mngr showing how it turns tutorial scripts into pytest cases, runs many agents in parallel, and merges fixes back into one branch. The case study offers a repeatable pattern for evaluating agent tools, so teams can borrow the tmux capture, artifact dashboards, and local-to-Modal handoff.
A Boris Cherny guide maps Claude Code mobile sessions, /teleport, /loop, hooks, worktrees, /batch, and custom agents into one workflow set. Use it to turn scattered commands into repeatable patterns for long-running coding sessions across terminal, desktop, and cloud.
OpenClaw 2026.3.28 exposes messaging and event handling as nine MCP tools, adds Responses API support, and lets plugins request permission during browser use. Use it to separate transport from agent logic so Claude Code, Codex, Cursor, and local harnesses can share the same account with less glue.
The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.
Cline launched Kanban, a local multi-agent board that runs Claude, Codex, and Cline CLI tasks in isolated worktrees with dependency chains and diffs. Teams can use it as a visual control layer for parallel coding agents on repo chores that split cleanly.
OpenAI rolled out Codex plugins across the app, CLI, and IDE extensions, with app auth, reusable skills, and optional MCP servers. Teams should test plugin-backed workflows and permission models before broad rollout.
Imbue released Latchkey, a library that prepends ordinary curl calls so local agents can use SaaS and internal APIs while credentials stay on the developer machine. Try it where agents need many HTTP integrations but should not see raw secrets.
OpenCode is adding remote sandboxes, synced state across laptop, server, and cloud, and more product surface inside its plugin system. That makes long-running off-laptop workflows more practical, but operators should still review telemetry, sandbox, and exposure defaults.
Claude Code 2.1.84 adds an opt-in PowerShell tool, new task and worktree hooks, safer MCP limits, and better startup and prompt-cache behavior. Anthropic also documented auto mode’s action classifier and added iMessage as a channel, so teams should review permissions and remote-control workflows.
Expect wraps browser QA for Claude Code, Codex, or Cursor into a CLI that records bug videos and feeds failures back into a fix loop. It gives coding agents a tighter UI validation cycle without requiring a custom browser harness.
OpenClaw 2026.3.24 adds native Microsoft Teams, OpenWebUI sub-agent access, Slack reply buttons, and a control surface for skills and tools. The release expands where the runtime can plug into enterprise workflows, while also increasing the surface area teams need to secure.
Cursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
OpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
OpenHands introduced EvoClaw, a benchmark that reconstructs milestone DAGs from repo history to test continuous software evolution instead of isolated tasks. The first results show agents can clear single tasks yet still collapse under regressions and technical debt over longer runs.
Agent Computer launched cloud desktops that boot in under half a second and expose persistent disks, shared credentials, SSH access, and ACP control for agents. It gives coding agents a faster place to run tools and reuse auth, but teams still need to design safe session and credential boundaries.
Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, aligning results more closely with the official benchmark setup. Top score dipped slightly to 78.8%, but the change reduces harness-specific confounds when comparing models.
LangChain published a free course on taking agents from first run to production-ready systems with LangSmith loops for observability and evals. The timing lines up with new NVIDIA integration messaging, so teams can study process and stack choices together.
Agent Flywheel lays out a planning-first workflow built on beads, agent mail, swarms, and TUI inspection for very large coding runs. It is useful because the guide exposes coordination primitives and review loops, not just benchmark screenshots.
A developer says an autoresearch loop hill-climbed a vibecoded Rust engine to 2718 Elo after running more than 70 experiments under a 500 ms move budget. The real takeaway is the workflow: automated experiment loops can optimize code against a measurable target.
Conductor now bundles plan mode, fast mode, skills, repo quick start, and an experimental merge-conflict UI around Codex sessions. Try it if you want a higher-level harness for long-running code agents, but watch the foreground chat UX on larger tasks.
ACE open-sources a platform that turns AGENTS.md instructions into evolving playbooks backed by execution history, with hosted and self-hosted options. It is a notable response to prompt drift and prompt extraction, because procedures become revisable operating docs instead of static prompts.
OpenHands published a skill-eval recipe with bounded tasks, deterministic verifiers, and no-skill baselines, then showed some skills speed agents up while others make them brittle. Teams shipping skill libraries should measure them per task and model before rollout.
Imbue open-sourced Offload, a Rust CLI that spreads test suites across local or Modal sandboxes from one TOML config. It is useful when agent-heavy teams are bottlenecked on verification instead of generation, especially in browser or CI-heavy stacks.
Cognition updated Devin so one session can break down large work and delegate subtasks to worker Devins running in separate VMs. It matters for audits, migrations, and QA runs where one long-context agent is slower than explicit parallelism.
Morph released FlashCompact, a specialized compaction model and SDK for coding agents, claiming 33k tokens per second and near-invisible long-context compression. Use it or copy the approach if compaction latency and noisy tool output are blocking longer agent runs.
OpenAI rolled out native subagents in Codex so a main agent can spawn specialized parallel threads and return results to one session. Try it for larger code reviews and feature builds where you want to split work without polluting the main context.
Factory released an analytics layer for teams deploying coding agents, surfacing usage, tool calls, activity, and productivity from tokens through pull requests. Use it if you need ROI, readiness, and cost visibility as agent adoption scales.
Hyperbrowser open-sourced HyperSkill, which reads live documentation and emits a structured SKILL.md file or graph an agent can navigate. Try it to replace hand-written tool instructions with generated skill trees you can drop into an agent project.
OpenClaw-RL released a fully asynchronous online training stack that turns live interaction feedback into ongoing agent updates with binary rewards and token-level OPD corrections. Use it as a starting point for online agent improvement only if you can score rollouts reliably and manage privacy risk.
OpenAI published runtime details for the Responses API computer environment, including shell loops, capped output, automatic compaction, proxied outbound traffic, and reusable skills folders. Use it as a reference architecture for hosted agents that need state, safety controls, and tool execution patterns.
Terminal-Bench maintainers said they independently verified cheating claims and removed OpenBlocks from the 2.0 leaderboard. Audit submission artifacts and harness details before relying on public coding-agent rankings.
The OpenClaw-RL paper proposes training agents continuously from normal interactions by turning user corrections, logs, and next-state feedback into rewards and word-level supervision. Watch it if you build persistent agents and want adaptation to come from live deployment traces instead of offline labeling.
Cursor published its internal benchmarking approach and reported wider separation between coding models than SWE-bench-style leaderboards show. Use it as a reference for production routing decisions, but validate results against your own online traffic and task mix.
Andrej Karpathy open-sourced autoresearch, a minimal agent loop for automated ML research, and reported roughly 20 additive changes that reduced nanochat’s Time to GPT-2 from 2.02 hours to 1.80 hours. Research teams can use it as a concrete recipe for closed-loop experimentation on any metric with cheap proxy evaluations.
OpenAI documented a new response field that separates in-progress commentary from terminal answers in GPT-5.4 turns, with guidance for replaying those messages in follow-up calls. Agent builders can stream status updates without mixing them into final model output.