Skip to content
AI Primer
release

OpenAI releases GPT-5.5 in ChatGPT and Codex for tool use

OpenAI launched GPT-5.5 in ChatGPT and Codex for coding, computer use, docs, sheets, and longer tool-driven tasks. Early tests showed stronger games and frontend builds, while pricing jumped again and Opus 4.7 comparisons started immediately.

7 min read
OpenAI releases GPT-5.5 in ChatGPT and Codex for tool use
OpenAI releases GPT-5.5 in ChatGPT and Codex for tool use

TL;DR

OpenAI's launch post came with a same-day Codex 0.124.0 release, and the interesting bits are spread across both. You can read Every's long day-zero vibe check, skim the Codex release notes, and watch creators immediately stress-test the model with games, landing pages, and canvas art.

What shipped

OpenAI framed GPT-5.5 as a model for "real work" that can hold complex goals, use tools, check its work, and keep going across longer runs, according to OpenAI's thread. The launch surface was not just ChatGPT. Codex got it the same day, and Codex users immediately noticed the pairing.

The ship list, pulled from OpenAI's launch post, LLMJunky's Codex update, and the Codex app feature list, looks like this:

  • GPT-5.5 in ChatGPT.
  • GPT-5.5 in Codex.
  • Reasoning-level keyboard shortcuts in Codex.
  • Stable hooks, including APPLY PATCH monitoring.
  • Fast service on by default for business and enterprise plans.
  • Browser control in the Codex app.
  • Sheets and Slides support.
  • Docs and PDFs support.
  • OS-wide dictation.
  • Auto-review mode.
  • Remote custom marketplaces surfaced in /plugins.

Rollout was not perfectly clean. While OpenAI's announcement said GPT-5.5 was live in Codex, AIandDesign's post showed at least some users not seeing it immediately.

Benchmarks

OpenAI's own chart, relayed in Allie K. Miller's post, emphasizes the agent stack: Terminal-Bench 2.0 at 82.7 percent versus 75.1 percent for GPT-5.4, GDPval at 84.9 percent versus 83.0 percent, OSWorld-Verified at 78.7 percent versus 75.0 percent, and ARC-AGI-2 at 85.0 percent versus 73.3 percent.

That still does not make this a clean sweep. thekitze's comparison chart counted 12 published head-to-head wins for GPT-5.5 against Opus 4.7, but also 8 losses. The losses are the ones coding power users will stare at first:

  • SWE-Bench Pro: Opus 4.7, 64.3 percent; GPT-5.5, 58.6 percent.
  • FinanceAgent v1.1: Opus 4.7, 64.4 percent; GPT-5.5, 60.0 percent.
  • MCP Atlas: Opus 4.7, 79.1 percent; GPT-5.5, 75.3 percent.
  • GPQA Diamond: Opus 4.7, 94.2 percent; GPT-5.5, 93.6 percent.
  • Humanity's Last Exam, no tools: Opus 4.7, 46.9 percent; GPT-5.5, 41.4 percent.
  • Humanity's Last Exam, with tools: Opus 4.7, 54.7 percent; GPT-5.5, 52.2 percent.
  • Graphwalks BFS 256k f1: Opus 4.7, 76.9 percent; GPT-5.5, 73.7 percent.
  • Graphwalks parents 256k f1: Opus 4.7, 93.6 percent; GPT-5.5, 90.1 percent.

There was also a quick third-party-style comparison in

, where GPT-5.5 appears above Opus 4.7 on the Intelligence Index curve.

Vibe Check

The most concrete hands-on claim came from Dan Shipper, cofounder and CEO of Every. In Shipper's main thread, he said GPT-5.5 scored 62 out of 100 on Every's senior engineer benchmark, versus 33 out of 100 for Opus 4.7, and said the model held multi-hour plans together well enough to refactor large codebases without getting lost.

The same thread also drew a sharper line around where it still falls short:

  • Opus 4.7 still won on plan quality.
  • Opus 4.7 was still better on some front-end and full-stack product work.
  • GPT-5.5 was weaker on Ruby.
  • Underspecified vibe-coding tasks still leaned toward Opus.

Those tradeoffs show up elsewhere in the evidence. thekitze's refactor post is the funniest summary of day-one agentic coding, 67 tool invocations, 4,200 new lines, 96 new files, gorgeous modularization, and nothing worked. At the same time, Dan Shipper's repost of Kelli Like's switch says one of Every's writers moved her workflow from Claude to Codex because GPT-5.5 felt better for writing.

The short read is that GPT-5.5 seems to have improved the "does the thing" factor that Shipper's live demo focused on, but the evidence pool still shows a gap between a spectacular run and a reliable one.

Creator outputs

UFO shooter generated with ChatGPT-5.5

Peter Yang's F-Zero-style test

Creative users went straight for interactive artifacts. In chrisfirst's post, GPT-5.5 produced a playable three.js UFO shooter from a single prompt. In Peter Yang's test, he said GPT-5.5 plus Codex was the first combo that passed his recurring F-Zero game challenge, including generating rival bots.

The front-end demos were just as aggressive. thekitze's landing-page post shows a full Tinkerer Club page built from one prompt, while CharaspowerAI's Codex clip says a mini-game arrived in under two minutes.

Images and canvas work also got immediate stress tests:

  • shows GPT-5.5 producing a polished two-sided brand card from a detailed prompt.
  • stevibe's tree-growing canvas test compares GPT-5.4 and 5.5, with 5.5 rendering a denser, more animated canopy.

That mix matters because it pushes GPT-5.5 past "better coding model" framing. The day-one use cases already span games, web design, graphic design, and code-assisted visual experiments.

Codex surfaces

A lot of the launch energy came from Codex itself. the Codex app feature list bundled GPT-5.5 with browser control, office-document support, dictation, and auto-review, which makes the model feel more like a desktop worker than a chatbot upgrade.

The rollout also hints at where OpenAI wants GPT-5.5 to show up next:

One reaction thread, Aakash Gupta's pricing post, argued the price increases tell the bigger business story than the model number does. That claim is commentary, not official pricing documentation, but it landed fast because this is now the third GPT-5-series step where people immediately asked what the extra capability will cost in production.

Auto-review

The most novel Codex detail is auto-review. Aakash Gupta's thread describes it as a second agent that watches the main coding agent's risky actions and silently approves routine ones, only escalating when the run crosses a threshold.

gives the concrete version. Codex attempted a system-level Discord upgrade, the auto-reviewer classified the action as medium risk with high authorization, approved reopening the step in a terminal, and then still stopped for the human to type the sudo password. That is a more specific design than a generic "trust the agent more" toggle.

The flow visible in the evidence has three layers:

  • The main agent proposes an action.
  • The reviewer agent scores the risk and decides whether to auto-approve.
  • The human only steps in for the bounded high-risk part, in this case an interactive password prompt.

If GPT-5.5's real job is longer tool runs, this guardian layer may end up mattering as much as the model itself. It is the part of the launch that tries to make 40-step agent sessions feel usable instead of interrupt-driven.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR4 posts
What shipped1 post
Benchmarks2 posts
Vibe Check2 posts
Creator outputs6 posts
Codex surfaces2 posts
Auto-review1 post