releaseApril 23, 2026

OpenAI releases GPT-5.5 with 82.7% Terminal-Bench and Codex browser control

OpenAI rolled out GPT-5.5 and GPT-5.5 Pro in ChatGPT and Codex, with higher scores on terminal, OS, cyber, and math evals than GPT-5.4. Codex also gained browser, document, and computer-use features for longer agent workflows.

9 min read

OpenAI releases GPT-5.5 with 82.7% Terminal-Bench and Codex browser control

TL;DR

OpenAI shipped GPT-5.5 into ChatGPT and Codex on April 23, with OpenAI's rollout post adding GPT-5.5 Pro for Pro, Business, and Enterprise users in ChatGPT.
According to OpenAIDevs, GPT-5.5's headline first-party scores were 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, 55.6% on Toolathlon, 35.4% on FrontierMath Tier 4, and 81.8% on CyberGym.
OpenAI's efficiency note said GPT-5.5 matches GPT-5.4 per-token latency while using fewer tokens on the same Codex tasks, but Sam Altman's pricing post also set API pricing at $5 per 1M input tokens and $30 per 1M output tokens, double GPT-5.4's list price.
OpenAIDevs' Codex browser post, OpenAIDevs' docs post, and OpenAIDevs' computer-use post turned the launch into more than a model swap: Codex now has browser use, document and spreadsheet generation, stronger computer use, and a new file viewer.
The cleanest caveat came from deedydas' coding chart read, who pointed out that GPT-5.5 trails Claude Opus 4.7 on SWE-Bench Pro, while Thomas Sottiaux's reply argued OpenAI no longer treats SWE-Bench as representative of real coding work.

You can read the official launch post, the GPT-5.5 system card, OpenAI's note on why it no longer uses SWE-Bench Verified, and the Codex v0.124.0 release. There is also an NVIDIA partner post about the company-wide Codex rollout, plus the public Terminal-Bench 2.0 leaderboard.

What shipped

OpenAI's top-line framing was simple: GPT-5.5 is a new frontier model for "real work," rolling out first in ChatGPT and Codex, with API access delayed behind additional safety and safeguards work. The launch also split the product into two user-facing variants, standard GPT-5.5 and GPT-5.5 Pro. OpenAI's product rollout note

Day-one availability broke down like this:

ChatGPT Plus, Pro, Business, Enterprise: GPT-5.5 rollout started April 23, per OpenAI's rollout post.
ChatGPT Pro, Business, Enterprise: GPT-5.5 Pro also shipped, per OpenAI's product rollout note.
Codex: GPT-5.5 shipped immediately, with the app and CLI update called out by Thomas Sottiaux.
API: "coming soon," according to OpenAIDevs and Sam Altman.

Two surface details mattered for engineers. ValsAI said its early-access setup used a 1M context window, while btibor91's launch summary noted a 400K context window inside Codex. And reach_vb showed Codex exposing a 1.5x fast mode during rollout.

Benchmarks

OpenAI's chart is strong on the agent harnesses that matter for daily tool use, not just on math or static QA. Compared with GPT-5.4, the same chart shows:

Terminal-Bench 2.0: 75.1% to 82.7%, +7.6 points, per OpenAIDevs.
OSWorld-Verified: 75.0% to 78.7%, +3.7 points, per OpenAIDevs.
BrowseComp: 82.7% to 84.4%, +1.7 points, per OpenAIDevs.
FrontierMath Tier 4: 27.1% to 35.4%, +8.3 points, per OpenAIDevs' research workflow post.
CyberGym: 79.0% to 81.8%, +2.8 points, per OpenAIDevs' research workflow post.
GeneBench: 19.0% to 25.0%, +6.0 points, per OpenAIDevs' research workflow post.

Third-party numbers broadly reinforced the launch. Artificial Analysis gave GPT-5.5 xhigh a 60 on its Intelligence Index, ahead of Claude Opus 4.7 and Gemini 3.1 Pro at 57, and said the model led GDPval-AA with 1785 Elo while using about 40% fewer output tokens than its predecessor. ArtificialAnlys' thread roundup

ARC Prize added another external point: arcprize put GPT-5.5 at 85.0% on ARC-AGI-2 and 95.0% on ARC-AGI-1, both verified. ValsAI's early-access run put GPT-5.5 at #2 overall on its Vals Index and #1 on SWE-Bench in its own harness, while also calling out a 1M context window in testing. ValsAI ValsAI's setup note

Coding caveat

The buried caveat is SWE-Bench Pro. OpenAI's own coding table put GPT-5.5 at 58.6% on SWE-Bench Pro, ahead of GPT-5.4 at 57.7% but behind Claude Opus 4.7 at 64.3%. deedydas' screenshot of the coding table

That matters because OpenAI's launch copy leaned much harder on Terminal-Bench 2.0 and an internal Expert-SWE eval than on public SWE-style comparisons. OpenAIDevs' coding post said GPT-5.5 is its strongest agentic coding model to date, which is true on the terminal harness it chose to headline, but not on every public coding benchmark in circulation.

OpenAI's response was unusually explicit. Replying to the critique, Thomas Sottiaux linked the company's earlier post on why it no longer evaluates SWE-Bench Verified and said engineers would be "missing out" if they treated SWE-Bench as representative of real work. So the dispute is not over the score, it is over which harness should count as the coding headline.

Token efficiency and pricing

OpenAI paired a price increase with a token-efficiency argument. OpenAI's efficiency note and Sam Altman both said GPT-5.5 matches GPT-5.4 on per-token latency and uses significantly fewer tokens per task, especially in Codex.

The list price still moved sharply:

GPT-5.4 API: $2.50 input, $15 output per 1M tokens, per rohanpaul_ai's pricing screenshot.
GPT-5.5 API: $5 input, $30 output per 1M tokens, per Sam Altman's pricing post.
GPT-5.5 Pro API: $30 input, $180 output per 1M tokens, per btibor91's launch summary.

Artificial Analysis' early-access testing is the cleanest outside check on the token story. ArtificialAnlys said per-token pricing doubled from GPT-5.4, but a roughly 40% token-use reduction cut the net increase to about 20% on its Intelligence Index workload. That is still a price hike, just not the full sticker shock implied by the rate card.

Codex browser, docs, and auto-review

The bigger product story is that Codex picked up enough surface area to act more like a desktop agent harness than a coding tab.

OpenAI's own rollout posts listed the new pieces:

Browser use in Codex, including clicking through web apps, testing flows, taking screenshots, and iterating on what it sees, per OpenAIDevs' browser post.
Higher-quality spreadsheets, slide decks, and documents in Microsoft Office and Google Drive, plus a new file viewer, per OpenAIDevs' docs post.
Stronger computer use across desktop apps, including seeing what's on screen and moving context across tools, per OpenAIDevs' computer-use post.
Auto-review mode, where a separate agent checks higher-risk steps so Codex can keep moving with fewer approvals, per OpenAIDevs' auto-review demo and gdb.
Global dictation, docs and PDF support, and non-dev mode in the updated app, according to Thomas Sottiaux's feature list.

The app update looked big enough that even OpenAI people started describing Codex less as a coder tool and more as a general computer-work product. embirico called out browser, sheets, slides, docs, PDFs, and auto-review in one breath, while gdb said Codex plus 5.5 now spans the "full spectrum of computer use."

Vibe check

The day-one hands-on reports clustered around one repeated point: duration. People kept describing GPT-5.5 as the first OpenAI model they trusted to stay on task for hours instead of breaking after a few loops.

A few concrete patterns showed up fast:

petergostev said he had migrations and queued prompts running for 7 to 8 hours, far beyond the 30-minute to 3-hour ceiling he saw before.
aidan_mclau said GPT-5.5 babysat an RL run for 31 hours, including periods of deliberate waiting instead of over-intervening.
skirano described GPT-5.5 merging a messy, conflict-heavy branch against main in one shot in about 20 minutes, then later pushing Flipper Zero apps over USB.
MatthewBerman said Codex could take a PRD and keep building for hours, while GPT-5.5 Pro in ChatGPT could work for 30, 60, or 90 minutes with docs and plugins.

The other notable cluster was personality and intent tracking. MatthewBerman argued OpenAI had clearly tuned the model's style, and omarsar0 described GPT-5.5 in Codex as sharper, better at intent, and less pause-prone. The pushback was that the model can still wander if the brief is loose: theo said GPT-5.5 writes better code than any model he has used, but can be hard to wrangle and more likely to explore if context is underspecified.

Where it shows up

The ecosystem rollout landed almost immediately, which is half the story for a model pitched as infrastructure for agents.

Box: Aaron Levie, Box CEO, said GPT-5.5 improved the company's complex knowledge-work evals by 10 points overall versus GPT-5.4, including 83% vs 64% in financial services and 78% vs 61% in healthcare.
Lovable: Lovable said early-access evals showed 23.1% fewer tool calls per request, 10% better roadblock breaking, and 12.5% higher scores on its hardest benchmarks at the same cost.
Ramp: OpenAIDevs' Ramp example showed engineers using GPT-5.5 in Codex to test full-stack changes end to end in QA.
NVIDIA: Sam Altman's NVIDIA rollout post said OpenAI rolled Codex out to the whole company, while the companion NVIDIA blog post said more than 10,000 employees had early access.
GitHub: scaling01's Joe Binder screenshot quoted GitHub VP of Product Joe Binder saying GPT-5.5 resolves 6-plus percentage points more SWE-Bench tasks in its evaluations and often reaches solutions in 50% to 60% fewer steps on more complex workflows.
PolyScope: marcelpociot showed GPT-5.5 becoming available in the app once Codex had been opened.

This is where the launch starts to feel less like a benchmark drop and more like OpenAI trying to turn Codex into a default work surface for companies, not just a coding assistant.

Safety, infrastructure, and the bug bounty

OpenAI used the launch to surface two unusual under-the-hood details. First, scaling01's screenshot from the launch materials says GPT-5.5 was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. The same note says Codex analyzed weeks of production traffic and wrote custom load-balancing and partitioning heuristics that lifted token generation speeds by more than 20%. OpenAIDevs' infrastructure post

Second, the safety framing stayed at "High," not "Critical," for cyber capabilities, with stronger safeguards for higher-risk cyber activity and trusted access for verified defensive work. OpenAIDevs' cyber post GPT-5.5 system card

The most concrete post-launch safety action was a new biology red-team incentive. jxnlco flagged OpenAI's GPT-5.5 Bio Bug Bounty, which offers $25,000 for the first universal jailbreak that clears all five bio safety questions, with testing set to run from April 28 through July 27, 2026. That is new information the launch post did not make central, and it says a lot about which failure modes OpenAI expects outsiders to pressure-test first.