Claude Code supports /goal refactor loops with e2e tests and autoreview
Practitioner threads showed Claude Code /goal refactors running until an evaluator marked them done, with live testing and autoreview checkpoints in the loop. The pattern turns long repo cleanup into trackable agent runs, though today’s evidence is user-led rather than a fresh Anthropic release.

TL;DR
- Anthropic's own Best practices for Claude Code says
/goalworks by re-checking a verifiable condition after every turn, and aakashgupta's mechanism thread describes the same split between a worker model and a separate evaluator. - In steipete's refactor prompt, the finish line is not "clean this up" but a tracked refactor loop with live tests, autoreview, commits, and progress written to
/tmp/refactor-{projectname}.md. - steipete's follow-up adds the part that makes the loop production-shaped: e2e checks can include computer use, browser use, keys, and crabbox so the agent verifies the real workflow before landing code.
- The surrounding workflow is increasingly multi-agent. steipete's orchestrator thread describes a maintainer loop that wakes up, routes work to threads, and combines orchestration with triage, autoreview, and computer-use skills, while steipete's release reply says three agents were already working one release in parallel.
- This story is about practitioner technique, not a fresh Anthropic launch: aakashgupta's guide post and steipete's crabbox update show people turning
/goalinto long-running maintenance and refactor runs right now.
You can read Anthropic's verification-first guidance, skim Aakash Gupta's public /goal breakdown, browse steipete's maintainer-orchestrator skill, and trace the cloud execution layer in crabbox. Linear is already packaging a similar pattern in coding sessions for Linear Agent.
The /goal contract
Anthropic frames the core idea plainly in its best-practices post: give Claude a check it can run, then let the loop continue until the check passes. The useful detail is that /goal is only one layer in that stack. Anthropic also points to harder gates like Stop hooks, which can block a turn from ending until a script passes.
Aakash Gupta's public guide puts the architecture in the terms people actually use in the terminal: the worker keeps taking turns, and a cheaper evaluator model checks whether the completion condition was met. His other key point is narrower and more important than the feature itself. The bottleneck moved from prompting to acceptance criteria.
That also explains why /goal locks onto one finish line. In aakashgupta's reply, he says multi-path work is better handled as sequential goals because the checker only grades the completion condition you wrote.
The refactor loop
steipete's prompt is a good snapshot of what people mean by loop engineering when the task is code cleanup instead of net-new code.
The loop has four concrete parts:
- Refactor until the architecture is satisfactory.
- Live test after each significant step.
- Run autoreview and commit as you go.
- Keep a running log in
/tmp/refactor-{projectname}.md.
The follow-up replies make the verification bar sharper. steipete's e2e testing reply says live testing should cover computer use, browser use, keys, and crabbox when needed for full end-to-end verification, while steipete's test coverage reply reduces the requirement to a blunt rule: you need e2e test coverage.
Review still happens in old-school form. steipete's diff review reply says he checks diffs for structure, repetition, and file breakup, and steipete's reading the code reply answers the obvious follow-up with "Reading the code!"
The orchestrator stack
The more interesting reveal is that /goal is not being used as a solo command. It sits inside a wider control plane of skills, threads, and delegated workers.
The linked maintainer-orchestrator skill describes that role as a control plane that delegates repository work, keeps one thread per repo, monitors worker progress, and brings back decision-ready PRs. The tweet version is shorter: wake up every five minutes, direct work to threads, and let some work land autonomously.
A few details in the replies fill out the operating model:
- steipete on wake-up cost says the periodic wake-up loop is "fairly cheap," and can run hourly or daily if needed.
- steipete on PRs as tracking says he sometimes keeps the PR around as a kind of issue tracker.
- steipete on thread creation says the agent is making the threads because it needs the organization.
- steipete on per-project folders says each project lives in its own folder, and agents create the threads inside that structure.
The workflow is already bleeding into team products. In linear's model disclosure, Linear said its new coding sessions start with Claude Code and Codex underneath the product layer.
Cloud runners
The final piece is where these loops run when they stop fitting on a laptop.
steipete says Codex had been looping for four days across multiple trees while building crabbox, and that the work was feasible because it was end-to-end verifiable. He also says the agent signs up for services automatically through browser and computer use, leaving him to handle credit cards and close the tasks he does not want.
That matches crabbox's own positioning in the repo docs: a remote testing and execution control plane for maintainers and AI agents that syncs a dirty checkout, runs commands remotely, streams output, and collects evidence. In steipete's cloud reply, he gives the human version of the value prop: fewer people walking around with MacBooks open just to keep agents alive in the cloud.