Claude Code users report auto mode, dynamic workflows, and critique loops finding 144 bugs
Practitioners shared repeatable setups for multi-hour Claude runs using auto approvals, dynamic workflows, cloud sessions, and critique loops. One large-codebase sweep reported 144 bugs fixed in about four hours with fewer false positives under model critique.

TL;DR
- bcherny's tip thread boiled long-running Claude Code sessions down to five levers: auto approvals, dynamic workflows,
/goalor/loop, cloud sessions, and end-to-end self-verification. - MParakhin's bug sweep said Claude Workflows found and fixed 144 bugs across a large codebase, and MParakhin's timing reply put the run at about four hours.
- False positives dropped sharply when the run used model critique, according to MParakhin's critique-loop numbers, which claimed about 20 false positives without critique versus 1 out of 174 with it.
- The official docs back most of the emerging playbook: Configure auto mode, dynamic workflows, goal mode, and scheduled tasks are all first-class Claude Code features.
- The rough edges are still visible. bridgemindai's ultracode complaint warned that 100-plus-agent runs can hallucinate and burn limits, while theo's SSH complaint said the remote terminal experience is still painful.
You can read Anthropic's own auto mode write-up, browse the dynamic workflows docs, and check the web session docs, which explicitly say sessions persist after you close the browser and can be monitored from the mobile app. The cost screen is also more granular than the thread made it sound: the cost docs say /usage breaks recent burn down by skills, subagents, and bundled commands. Even the small-print release notes matter here, because the v2.1.166 release added fallback models, hardened cross-session messaging, and a way to disable default-model thinking.
Auto mode
Anthropic's permission modes doc defines auto as the mode where "everything" can run with background safety checks, and positions it for long tasks and reduced prompt fatigue. Anthropic's engineering post frames it as a middle ground between manual approvals and --dangerously-skip-permissions, with one layer scanning tool outputs for prompt injection and another classifying tool calls before they execute.
That matches the field advice in bcherny's tip thread, which put auto mode first for multi-hour runs because permission popups break flow. A separate reply in bcherny's skills reply adds a useful nuance: the point is not to micromanage every skill invocation, but to tell the model the outcome you want and let it pick the right skills.
Dynamic workflows
Anthropic's workflows doc describes dynamic workflows as JavaScript scripts that Claude writes, then runs in the background to orchestrate subagents at scale while the main session stays responsive. The official examples line up almost exactly with what users are doing in public: codebase-wide bug sweeps, large migrations, and cross-checking work across multiple agents.
The public playbook is already pretty concrete:
- bcherny's workflow reply says the user prompt can be as simple as telling Claude to "use a workflow."
- bcherny's use-case list names six strong fits: complex feature builds, language migrations, framework migrations, repeated profiling against memory or CPU targets, flaky-test cleanup, and CI profiling.
- aibuilderclub_'s subagent note says subagents can now spawn their own subagents up to five levels deep, turning a run into an agent tree.
The showpiece number came from MParakhin's bug sweep, which said Claude Workflows found and fixed 144 bugs across a whole codebase. MParakhin's timing reply said that took about four hours.
Critique loops
The strongest repeated idea across the evidence is not raw agent count. It is verification.
bcherny said the most important ingredient in very long runs is self-verification, ideally with a workflow that tests the result end to end in a browser and looks for edge cases and UI issues. bcherny's main thread extends that beyond web apps, naming Chrome MCP for web checks, mobile simulators for iOS and Android, and the ability to start the full server stack for backend work.
MParakhin's critique-loop numbers put numbers on the pattern:
- 174 total findings after additional bugs were counted
- about 20 false positives without a critique loop
- 1 false positive out of 174 with critique across several models
A follow-up in MParakhin's critique-loop reply says the critique loop was still necessary even with the strongest model. Another in MParakhin's critique-loop confirmation says it was being run on everything.
Keeping runs alive
The thread's third and fourth tips map neatly onto official product surfaces that are easy to miss if you only think of Claude Code as a local terminal app.
Anthropic's goal docs say /goal keeps Claude working across turns until a model confirms a completion condition. Anthropic's scheduled-task docs say /loop reruns a prompt on an interval, or self-paces if no interval is provided, and explicitly pitch it for polling deploys and babysitting long-running builds.
For compute that outlives your laptop, Anthropic's web docs say cloud sessions run on Anthropic-managed infrastructure, persist after you close the browser or laptop, and can be monitored from the mobile app. Anthropic's routines docs take the same idea further with scheduled, API-triggered, and GitHub-triggered runs in the cloud.
One small operational detail from bcherny's /usage reply is easy to overlook: /usage can break recent burn down by skills, MCPs, and plugins. The official cost docs say the same screen attributes plan usage to skills, subagents, and bundled commands.
What breaks first
The failure mode in the public feedback is not subtle. More agents can mean more bugs, more burn, or both.
bridgemindai's ultracode complaint said runs with more than 100 Opus 4.8 agents hallucinated, introduced bugs, and mostly one-shotted usage limits. bridgemindai's follow-up and another follow-up from bridgemindai reinforced that the useful cases felt limited.
SSH is the other obvious sore spot. theo's SSH complaint called Claude Code "insane" over SSH, and theo's SSH details listed recurring auth problems, awkward image pasting, broken scrolling in fullscreen mode, noisy effort coloring, and version-number-based pane naming.
The docs hint at the same cost tradeoff from the other direction. aibuilderclub_'s subagent note warned to watch token burn on deep agent trees, and the cost docs note that /usage exists precisely because these sessions can consume a lot of model work very quickly.
Reliability patches
The final tell is the shipping cadence. While people were pushing Claude Code into four-hour bug sweeps and 100-agent runs, Anthropic was also landing stability fixes almost daily.
According to the v2.1.166 release, the CLI added up to three fallbackModel options when the primary model is overloaded, hardened cross-session messaging so relayed sessions cannot carry user authority for permission requests, and let users disable default-model thinking with MAX_THINKING_TOKENS=0 or --thinking disabled. The same release also fixed stuck remote sessions, JetBrains terminal flicker, a recurring image-processing error, and several background-agent bugs, all issues that matter more once sessions run for hours.
That patch train kept moving. ClaudeCodeLog's 2.1.167 release and ClaudeCodeLog's 2.1.168 release were both summarized as bug-fix and reliability releases inside about a day, while claudeai's usage-limit increase said Anthropic had doubled Claude Cowork usage limits for a month to support bigger delegated tasks.