workflowMay 30, 2026

Codex supports 56-hour tasks as builders report passkey and browser failures

Codex users shared 56-hour task runs, PM-to-PR workflows, and a new black-box session recorder for tracking drift, token use, and incomplete responses. The longer autonomous sessions matter because browser auth gaps, passkey failures, and tool-selection bugs become real blockers once Codex is used beyond quick code generation.

4 min read

Codex supports 56-hour tasks as builders report passkey and browser failures

TL;DR

Dan Shipper's usage snapshot shows Codex handling a 56-hour task, while Aakash Gupta's interview thread reports a separate 60-hour refactor that burned 350 million tokens.
According to Aakash Gupta's interview thread, one OpenAI team banned human-written code from a repo, absorbed a month of slowdown, and ended up with PM-written PRDs turning into shipped PRs by Friday.
Reliability gaps show up fast once sessions get long: jjpcodes' passkey complaint says the built-in browser breaks on sites that require passkeys, and thekitze's SVG mishap post says Codex skipped a native OpenAI skill and improvised the wrong tool.
Power users are already building extra scaffolding around it: the codex-blackbox Reddit post logs token use, incomplete responses, and regressions, while steipete's workflow note pairs Codex with autoreview and crabbox for multi-hour runs.

A lot of the interesting detail sits outside the splashy task-length screenshots. You can read Every's knowledge-work guide, inspect the codex-blackbox repo, and see the surrounding skill layer in the autoreview skill doc and crabbox.

56-hour runs

The headline number here is not just that Codex can stay busy for 56 hours. steipete's workflow note says his own runs moved from roughly 30 to 60 minutes into 4 to 10 hour jobs after layering in /goal, autoreview, and crabbox.

That lines up with Aakash Gupta's interview thread, which describes a 60-hour refactor where Ryan Lopopolo gave only two extra prompts across the run. The operating model is less chat assistant, more background worker that keeps going until the repo or the harness stops it.

Guardrails in the repo

The most useful detail in Gupta's reporting is the sequence of fixes that made long runs viable. The team reportedly spent months making the repository legible enough for the agent, then encoded house style into automated checks instead of relying on engineers to patch mistakes by hand.

The three phases in Gupta's thread are concrete:

Make the repo legible with docs, architecture decisions, and an agents.md file.
Encode team taste into CI lints and AI reviewer personas.
Expand who can ship, including PMs writing PRDs and designers running painted-door experiments.

That explains why the first month was slower. Gupta's follow-up thread says engineers were forced to turn each recurring failure into a permanent guardrail, even when typing the fix manually would have been much faster.

Browser and tool failures

Long autonomous sessions make boring product gaps feel huge. jjpcodes' passkey complaint says the built-in browser falls over on sites that require passkeys, which turns ordinary auth flows into hard blockers.

Tool selection looks shaky too. In thekitze's SVG mishap post, Codex ignored an OpenAI-native skill and improvised with SVG instead. steipete's debugging note adds a second pattern: the same model may happily declare code bug-free until you explicitly tell it a bug exists, at which point it keeps digging and starts surfacing issues.

That cluster of complaints is why thekitze's browser reply bluntly says to avoid the Codex browser altogether. The raw task length is impressive, but the fragile bits are concentrated around navigation, auth, and choosing the right built-in tool.

The sidecar tools around Codex

r/openclaw

Openclaw, codex cli and codex ui live session all together

0 comments

A small tooling ecosystem is forming around the model's weak spots. the codex-blackbox Reddit post pitches a live session recorder for Codex CLI, Codex UI, and OpenClaw sessions, specifically to track model changes, incomplete responses, token use, and regressions after updates.

Other users are building structure above the agent rather than inside it. Dan Shipper's thread setup describes separate pulse, log, inbox, and router threads for recurring knowledge-work jobs, and Amir Mushich's branded video demo shows Codex driving a product video through custom BrandSkill.md and MotionSkill.md files.

Those add-ons all point the same way: Codex is already being used as a long-running production system, but the workflows getting shared most often are wrappers, recorders, and skills that make its behavior easier to steer and inspect.

TL;DR

56-hour runs

Guardrails in the repo

Browser and tool failures

The sidecar tools around Codex

Discussion across the web