updateApril 24, 2026

Codex users report one-shot bug fixes, 10-hour runs, and lower token burn a day after GPT-5.5 launch

A day after GPT-5.5 and the new Codex workflows launched, developers reported one-shot bug fixes, longer unattended runs, and lower token use in real coding tasks. The early hands-on comparisons matter because they are already shifting some teams' default agent workflow away from Claude Code.

7 min read

Codex users report one-shot bug fixes, 10-hour runs, and lower token burn a day after GPT-5.5 launch

TL;DR

A day after launch, early Codex users were already reporting longer autonomous runs, with nummanali's 10-hour run report, petergostev's 7-hour migration, and itsclivetime's 38-hour agent-team post all describing work sessions that kept going far past the usual 30 to 180 minute ceiling.
The strongest hands-on pattern was debugging and multi-file repair: haider1's backend test said GPT-5.5 followed a tangled payment flow across files with less hand-holding, while bridgemindai's side-by-side bug fix showed GPT-5.5 landing on a concrete root cause faster than Claude Opus 4.7.
OpenAI framed the release as a broader Codex push, not just a model bump. According to reach_vb's shipped-this-week list and gdb's auto-review post, the stack now includes auto-review, browser use, document generation, OS-wide dictation, and GPT-5.5.
Token efficiency is one reason the launch is landing, despite higher list pricing. OpenAIDevs' Perplexity example claimed 56% fewer tokens on the same complex tasks, while scaling01's pricing reaction argued the lower token burn partly offsets the 2x API price increase.
Some of the shift is about the competition stumbling at the same time: Anthropic postmortem discussion on HN and HN quota thread surfaced harness bugs, token waste complaints, and subscription frustration around Claude Code just as developers like omarsar0's switch report and haider1's team switch post said they were moving work to Codex.

You can read the official GPT-5.5 launch post, skim the Codex docs, check Simon Willison's notes on the GPT-5.5 prompting guide, and compare that launch energy with Anthropic's Claude Code postmortem. There is also a live community thread on Hacker News, plus a small but telling detail in Codex's model list diff where GPT-5.4 stopped being labeled the latest frontier model a day before launch.

What shipped in Codex

OpenAI spent the week turning Codex into more of a desktop workbench than a coding-only shell. reach_vb's shipped-this-week list condensed the launch into six items, while gdb's launch-week post described the same package as more computer work, more memory, and more ongoing independent work.

The concrete pieces surfaced across the docs and launch posts:

GPT-5.5 as the new model for coding and professional work, per the official launch post
Auto-review, where a separate agent checks higher-risk actions before they run, per gdb's auto-review post
Browser use inside Codex, per testingcatalog's feature roundup
OS-wide voice dictation, per dkundel's dictation post
Better document and artifact generation, including PDFs, docs, TeX, sheets, and slides, per reach_vb's shipped-this-week list

That packaging matters because most of the hands-on praise was about model plus harness, not raw eval numbers. daniel_mac8's model-plus-harness post said it plainly: model plus harness beats either one alone.

Long unattended runs

The weirdest early signal was not one-shot demos. It was how many people said GPT-5.5 just kept going.

The reports came in different shapes, but they rhyme:

nummanali's 10-hour run report said GPT-5.5 spent 10 or more hours moving from MLX optimization in TypeScript to building an inference engine in the Responses API.
petergostev's 7-hour migration said a migration kept running for 7 plus hours, after older models usually tapped out after 30 minutes to a few hours.
nummanali's dinner test said he told the model to keep working for an hour, then came back two hours later and it was still going.
itsclivetime's 38-hour agent-team post said a 38 hour run used GPT-5.5 and framed agent-team management as the new bottleneck.
johnohallman's end-to-end project post said GPT-5.5 was the first model he had that could work end to end on projects for hours or days.

That lines up with OpenAI's own launch framing. gdb's launch-week post said Codex could now run more ongoing work independently, and Simon Willison highlighted OpenAI's recommendation that long tasks should emit short visible progress updates in the prompting guide. The product behavior and the prompting advice appear to be moving together.

Bug fixing and verification loops

The most concrete early wins were in debugging, especially when the bug spanned multiple files or required repeated verification.

A few examples stood out:

bridgemindai's side-by-side bug fix showed GPT-5.5 identifying a specific missing branch in StripeService.hasActiveProSubscription, while Claude Opus 4.7 produced three incorrect hypotheses.
haider1's backend test said GPT-5.5 traced webhook handling, order status updates, retry logic, and database writes across a messy production-style backend without missing side effects.
Hangsiin's optimization bug post said GPT-5.5 eventually solved a month-old game optimization issue by running a verification loop after GPT-5.4 and GPT-5.3-codex had failed.
bridgemindai's workflow screenshot showed Codex verifying fixes with npm run typecheck, narrowing search scope, and exposing /skills in the prompt surface.

This is where the launch starts to feel more like a harness story than a benchmark story. OpenAI's Codex docs emphasize tools, skills, and computer use, and OpenAIDevs' Ramp auto-review example pitched auto-review as a way to keep tests and builds moving with fewer approvals.

Token burn and the price trade

GPT-5.5 did not launch as the cheaper model. The pricing screenshots circulating before and during launch put it at $5 per million input tokens and $30 per million output tokens, versus $2.50 and $15 for GPT-5.4, as shown in rohanpaul_ai's pricing screenshot.

The counterargument was efficiency:

OpenAIDevs' Perplexity example claimed 56% fewer tokens on the same complex Perplexity Computer tasks.
scaling01's GitHub VP quote cited Joe Binder, VP of Product at GitHub, saying GPT-5.5 reaches solutions in substantially fewer steps, often 50 to 60% less, on more complex workflows.
scaling01's pricing reaction said multiple benchmarks pointed to roughly half the token use of GPT-5.4.
cedric_chee's pricing thread summarized OpenAI's own line as higher price, higher intelligence, and materially better token efficiency.

The official launch post says GPT-5.5 is coming to the API, and Simon Willison's llm 0.31 note shows third-party tooling already adding gpt-5.5 model support and new verbosity controls. So the migration question is already shifting from list price to effective task cost.

Why the switches away from Claude Code are happening now

A lot of the most striking GPT-5.5 praise came paired with an explicit comparison target: Claude Code.

The reasons were not identical, but they cluster:

haider1's team switch post said GPT-5.5 needed less guidance on bugs spread across multiple files, and that Claude Code had become too expensive.
omarsar0's switch report said Codex felt sharper, more direct, and better balanced in how much effort it applied.
omarsar0's Claude Code complaint said Claude sometimes told him what to do instead of doing the work, and that outputs often read like marketing copy.
willdepue's launch reaction said the model underwhelmed on evals but ripped inside Codex on complex technical projects.

The timing helps explain the tone. Anthropic's postmortem, amplified in Anthropic postmortem discussion on HN, said recent quality complaints were real and traced to harness issues, including a bug that repeatedly cleared older thinking from stale sessions. Separate HN threads like HN quota thread and HN pricing-test thread collected complaints about quota exhaustion, token waste, and pricing uncertainty. GPT-5.5 landed straight into that opening.

The harness is spreading beyond code

The last useful reveal is how quickly users started treating Codex as a general desktop agent instead of a code editor with extras.

A few examples from the first day:

jxnlco's PDF signing post said Codex computer use signed a PDF with Dropbox Sign and submitted it to Schwab.
dkundel's dictation post showed built-in dictation working across desktop apps.
ctatedev's game-building thread said a Three.js creature-catching game, including assets from GPT-image-2, came together through a long chain of natural-language prompts in the Codex CLI.
jxnlco's After Effects post said Codex could one-shot After Effects plugins into Remotion components.
cedric_chee's Linux port post showed GPT-5.5 fixing a packaging bug while helping port the Codex desktop app itself to Linux.

That is probably the most useful frame for the launch day reaction. Developers were not mostly posting benchmark charts. They were posting workflow collapse: fewer approvals, longer runs, lower token burn, and more tasks that now fit inside one agent surface.

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads

TL;DR2 posts

What shipped in Codex3 posts

Long unattended runs3 posts

Bug fixing and verification loops3 posts

Token burn and the price trade3 posts

Why the switches away from Claude Code are happening now2 posts

The harness is spreading beyond code3 posts

On Hacker News· 3 posts

An update on recent Claude Code quality reports917 upvotes · 701 comments Pro Max 5x quota exhausted in 1.5 hours despite moderate usage759 upvotes · 660 comments Claude Code to be removed from Anthropic's Pro plan?679 upvotes · 639 comments