Skip to content
AI Primer
update

Codex users report one-shot bug fixes, 10-hour runs, and lower token burn a day after GPT-5.5 launch

A day after GPT-5.5 and the new Codex workflows launched, developers reported one-shot bug fixes, longer unattended runs, and lower token use in real coding tasks. The early hands-on comparisons matter because they are already shifting some teams' default agent workflow away from Claude Code.

7 min read
Codex users report one-shot bug fixes, 10-hour runs, and lower token burn a day after GPT-5.5 launch
Codex users report one-shot bug fixes, 10-hour runs, and lower token burn a day after GPT-5.5 launch

TL;DR

You can read the official GPT-5.5 launch post, skim the Codex docs, check Simon Willison's notes on the GPT-5.5 prompting guide, and compare that launch energy with Anthropic's Claude Code postmortem. There is also a live community thread on Hacker News, plus a small but telling detail in Codex's model list diff where GPT-5.4 stopped being labeled the latest frontier model a day before launch.

What shipped in Codex

OpenAI spent the week turning Codex into more of a desktop workbench than a coding-only shell. reach_vb's shipped-this-week list condensed the launch into six items, while gdb's launch-week post described the same package as more computer work, more memory, and more ongoing independent work.

The concrete pieces surfaced across the docs and launch posts:

That packaging matters because most of the hands-on praise was about model plus harness, not raw eval numbers. daniel_mac8's model-plus-harness post said it plainly: model plus harness beats either one alone.

Long unattended runs

The weirdest early signal was not one-shot demos. It was how many people said GPT-5.5 just kept going.

The reports came in different shapes, but they rhyme:

That lines up with OpenAI's own launch framing. gdb's launch-week post said Codex could now run more ongoing work independently, and Simon Willison highlighted OpenAI's recommendation that long tasks should emit short visible progress updates in the prompting guide. The product behavior and the prompting advice appear to be moving together.

Bug fixing and verification loops

The most concrete early wins were in debugging, especially when the bug spanned multiple files or required repeated verification.

A few examples stood out:

  • bridgemindai's side-by-side bug fix showed GPT-5.5 identifying a specific missing branch in StripeService.hasActiveProSubscription, while Claude Opus 4.7 produced three incorrect hypotheses.
  • haider1's backend test said GPT-5.5 traced webhook handling, order status updates, retry logic, and database writes across a messy production-style backend without missing side effects.
  • Hangsiin's optimization bug post said GPT-5.5 eventually solved a month-old game optimization issue by running a verification loop after GPT-5.4 and GPT-5.3-codex had failed.
  • bridgemindai's workflow screenshot showed Codex verifying fixes with npm run typecheck, narrowing search scope, and exposing /skills in the prompt surface.

This is where the launch starts to feel more like a harness story than a benchmark story. OpenAI's Codex docs emphasize tools, skills, and computer use, and OpenAIDevs' Ramp auto-review example pitched auto-review as a way to keep tests and builds moving with fewer approvals.

Token burn and the price trade

GPT-5.5 did not launch as the cheaper model. The pricing screenshots circulating before and during launch put it at $5 per million input tokens and $30 per million output tokens, versus $2.50 and $15 for GPT-5.4, as shown in rohanpaul_ai's pricing screenshot.

The counterargument was efficiency:

The official launch post says GPT-5.5 is coming to the API, and Simon Willison's llm 0.31 note shows third-party tooling already adding gpt-5.5 model support and new verbosity controls. So the migration question is already shifting from list price to effective task cost.

Why the switches away from Claude Code are happening now

A lot of the most striking GPT-5.5 praise came paired with an explicit comparison target: Claude Code.

The reasons were not identical, but they cluster:

The timing helps explain the tone. Anthropic's postmortem, amplified in Anthropic postmortem discussion on HN, said recent quality complaints were real and traced to harness issues, including a bug that repeatedly cleared older thinking from stale sessions. Separate HN threads like HN quota thread and HN pricing-test thread collected complaints about quota exhaustion, token waste, and pricing uncertainty. GPT-5.5 landed straight into that opening.

The harness is spreading beyond code

The last useful reveal is how quickly users started treating Codex as a general desktop agent instead of a code editor with extras.

A few examples from the first day:

That is probably the most useful frame for the launch day reaction. Developers were not mostly posting benchmark charts. They were posting workflow collapse: fewer approvals, longer runs, lower token burn, and more tasks that now fit inside one agent surface.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR2 posts
What shipped in Codex3 posts
Long unattended runs3 posts
Bug fixing and verification loops3 posts
Token burn and the price trade3 posts
Why the switches away from Claude Code are happening now2 posts
The harness is spreading beyond code3 posts