OpenAI releases GPT-5.5 in ChatGPT and Codex for tool use
OpenAI launched GPT-5.5 in ChatGPT and Codex for coding, computer use, docs, sheets, and longer tool-driven tasks. Early tests showed stronger games and frontend builds, while pricing jumped again and Opus 4.7 comparisons started immediately.

TL;DR
- OpenAI's launch post introduced GPT-5.5 as a model for longer tool-using runs, and said it is available in ChatGPT and Codex from day one.
- In Allie K. Miller's benchmark rundown, OpenAI's launch table shows gains in terminal use, browsing, knowledge work, math, and OSWorld-style computer tasks, while thekitze's scorecard notes Claude Opus 4.7 still leads on several coding and reasoning evals.
- Early hands-on reports split into two buckets: Dan Shipper's day-zero test called GPT-5.5 a step change for coding and writing, while thekitze's refactor post said a huge one-shot rewrite looked impressive but did not actually work.
- Creator demos moved fast: chrisfirst's game clip, Peter Yang's F-Zero test, and thekitze's landing-page build all show GPT-5.5 and Codex producing shippable-looking interactive work from short prompts.
- The Codex side is part of the story, not a footnote. LLMJunky's Codex update says GPT-5.5 landed alongside stable hooks and faster enterprise defaults, while Aakash Gupta's auto-review thread highlights a new guardian-agent layer meant to cut approval spam.
OpenAI's launch post came with a same-day Codex 0.124.0 release, and the interesting bits are spread across both. You can read Every's long day-zero vibe check, skim the Codex release notes, and watch creators immediately stress-test the model with games, landing pages, and canvas art.
What shipped
OpenAI framed GPT-5.5 as a model for "real work" that can hold complex goals, use tools, check its work, and keep going across longer runs, according to OpenAI's thread. The launch surface was not just ChatGPT. Codex got it the same day, and Codex users immediately noticed the pairing.
The ship list, pulled from OpenAI's launch post, LLMJunky's Codex update, and the Codex app feature list, looks like this:
- GPT-5.5 in ChatGPT.
- GPT-5.5 in Codex.
- Reasoning-level keyboard shortcuts in Codex.
- Stable hooks, including APPLY PATCH monitoring.
- Fast service on by default for business and enterprise plans.
- Browser control in the Codex app.
- Sheets and Slides support.
- Docs and PDFs support.
- OS-wide dictation.
- Auto-review mode.
- Remote custom marketplaces surfaced in
/plugins.
Rollout was not perfectly clean. While OpenAI's announcement said GPT-5.5 was live in Codex, AIandDesign's post showed at least some users not seeing it immediately.
Benchmarks
OpenAI's own chart, relayed in Allie K. Miller's post, emphasizes the agent stack: Terminal-Bench 2.0 at 82.7 percent versus 75.1 percent for GPT-5.4, GDPval at 84.9 percent versus 83.0 percent, OSWorld-Verified at 78.7 percent versus 75.0 percent, and ARC-AGI-2 at 85.0 percent versus 73.3 percent.
That still does not make this a clean sweep. thekitze's comparison chart counted 12 published head-to-head wins for GPT-5.5 against Opus 4.7, but also 8 losses. The losses are the ones coding power users will stare at first:
- SWE-Bench Pro: Opus 4.7, 64.3 percent; GPT-5.5, 58.6 percent.
- FinanceAgent v1.1: Opus 4.7, 64.4 percent; GPT-5.5, 60.0 percent.
- MCP Atlas: Opus 4.7, 79.1 percent; GPT-5.5, 75.3 percent.
- GPQA Diamond: Opus 4.7, 94.2 percent; GPT-5.5, 93.6 percent.
- Humanity's Last Exam, no tools: Opus 4.7, 46.9 percent; GPT-5.5, 41.4 percent.
- Humanity's Last Exam, with tools: Opus 4.7, 54.7 percent; GPT-5.5, 52.2 percent.
- Graphwalks BFS 256k f1: Opus 4.7, 76.9 percent; GPT-5.5, 73.7 percent.
- Graphwalks parents 256k f1: Opus 4.7, 93.6 percent; GPT-5.5, 90.1 percent.
There was also a quick third-party-style comparison in
, where GPT-5.5 appears above Opus 4.7 on the Intelligence Index curve.
Vibe Check
The most concrete hands-on claim came from Dan Shipper, cofounder and CEO of Every. In Shipper's main thread, he said GPT-5.5 scored 62 out of 100 on Every's senior engineer benchmark, versus 33 out of 100 for Opus 4.7, and said the model held multi-hour plans together well enough to refactor large codebases without getting lost.
The same thread also drew a sharper line around where it still falls short:
- Opus 4.7 still won on plan quality.
- Opus 4.7 was still better on some front-end and full-stack product work.
- GPT-5.5 was weaker on Ruby.
- Underspecified vibe-coding tasks still leaned toward Opus.
Those tradeoffs show up elsewhere in the evidence. thekitze's refactor post is the funniest summary of day-one agentic coding, 67 tool invocations, 4,200 new lines, 96 new files, gorgeous modularization, and nothing worked. At the same time, Dan Shipper's repost of Kelli Like's switch says one of Every's writers moved her workflow from Claude to Codex because GPT-5.5 felt better for writing.
The short read is that GPT-5.5 seems to have improved the "does the thing" factor that Shipper's live demo focused on, but the evidence pool still shows a gap between a spectacular run and a reliable one.
Creator outputs
UFO shooter generated with ChatGPT-5.5
Peter Yang's F-Zero-style test
Creative users went straight for interactive artifacts. In chrisfirst's post, GPT-5.5 produced a playable three.js UFO shooter from a single prompt. In Peter Yang's test, he said GPT-5.5 plus Codex was the first combo that passed his recurring F-Zero game challenge, including generating rival bots.
The front-end demos were just as aggressive. thekitze's landing-page post shows a full Tinkerer Club page built from one prompt, while CharaspowerAI's Codex clip says a mini-game arrived in under two minutes.
Images and canvas work also got immediate stress tests:
- shows GPT-5.5 producing a polished two-sided brand card from a detailed prompt.
- stevibe's tree-growing canvas test compares GPT-5.4 and 5.5, with 5.5 rendering a denser, more animated canopy.
That mix matters because it pushes GPT-5.5 past "better coding model" framing. The day-one use cases already span games, web design, graphic design, and code-assisted visual experiments.
Codex surfaces
A lot of the launch energy came from Codex itself. the Codex app feature list bundled GPT-5.5 with browser control, office-document support, dictation, and auto-review, which makes the model feel more like a desktop worker than a chatbot upgrade.
The rollout also hints at where OpenAI wants GPT-5.5 to show up next:
- ChatGPT and Codex were the official day-one surfaces, per OpenAI's announcement.
- OpenRouter listings for
gpt-5.5-20260423andgpt-5.5-pro-20260423were spotted in the OpenRouter post. - Pre-launch sightings in LLMJunky's Codex spotting thread suggest the app build was telegraphing the model before announcement.
One reaction thread, Aakash Gupta's pricing post, argued the price increases tell the bigger business story than the model number does. That claim is commentary, not official pricing documentation, but it landed fast because this is now the third GPT-5-series step where people immediately asked what the extra capability will cost in production.
Auto-review
The most novel Codex detail is auto-review. Aakash Gupta's thread describes it as a second agent that watches the main coding agent's risky actions and silently approves routine ones, only escalating when the run crosses a threshold.
gives the concrete version. Codex attempted a system-level Discord upgrade, the auto-reviewer classified the action as medium risk with high authorization, approved reopening the step in a terminal, and then still stopped for the human to type the sudo password. That is a more specific design than a generic "trust the agent more" toggle.
The flow visible in the evidence has three layers:
- The main agent proposes an action.
- The reviewer agent scores the risk and decides whether to auto-approve.
- The human only steps in for the bounded high-risk part, in this case an interactive password prompt.
If GPT-5.5's real job is longer tool runs, this guardian layer may end up mattering as much as the model itself. It is the part of the launch that tries to make 40-step agent sessions feel usable instead of interrupt-driven.