updateApril 25, 2026

GPT-5.5 users report 4-10x shorter runs and smoother tool calls one day after launch

Users and third-party evals reported shorter runs, stronger long-context scores, and faster rollout into Cursor and other tools a day after GPT-5.5 hit the API. Higher per-token pricing may be partly offset by lower loop time and fewer tool-call stalls, so watch early bench data before changing defaults.

5 min read

TL;DR Tool loops Benchmarks Ecosystem rollouts Prompt migration Discussion across the web

GPT-5.5 users report 4-10x shorter runs and smoother tool calls one day after launch

TL;DR

Early hands-on reports from rishdotblog's agent test, omarsar0's Codex thread, and thdxr's latency note all converged on the same change: GPT-5.5 starts acting faster inside tool loops, with less visible stall before it begins work.
Hangsiin's KSST run reported flat top-line score versus GPT-5.4 Pro but much shorter generations, while Dillon Uzar's Context Arena results measured an +11.1 point jump at 128k AUC for GPT-5.5 xhigh over GPT-5.4 xhigh.
Third-party leaderboards moved quickly: Wes Roth's Terminal-Bench post said GPT-5.5 took the top Terminal-Bench spot at 82.7, and Lech Mazur's Extended NYT Connections update showed GPT-5.5 passing Opus 4.6 on that benchmark too.
Day-one rollout was fast: Cursor's availability post put GPT-5.5 into Cursor with a 72.8% CursorBench score, while Lewis Tunstall's ml-intern update added the model to Hugging Face's research agent stack.
OpenAI's own migration guidance, summarized in Simon Willison's prompting-guide writeup, framed GPT-5.5 as a model family that needs retuning rather than a drop-in swap, even as Context Arena's pricing note flagged a 2x per-token price increase over GPT-5.4.

OpenAI also published a GPT-5.5 prompting guide that tells developers to start from a fresh baseline, not drag old prompt stacks forward. Simon Willison's note on Romain Huet says Codex and the main model line are now unified, so there is no separate coding variant to wait for. You can also inspect the ml-intern pull request that wired GPT-5.5 into Hugging Face's agent harness a few hours after launch.

Tool loops

The most consistent day-one story was not a benchmark chart. It was that GPT-5.5 felt less sticky inside agent runs.

According to rishdotblog's early agent report, GPT-5.5 matched Claude-class tool calling while staying fast enough for long-running agents. thdxr's latency note described the same shift more simply, saying the model now behaves more like Claude because it does not pause for a long time before it starts doing things.

omarsar0's Codex migration thread adds a second useful detail: the smoother feel survived a real harness migration. He said MCP tools worked out of the box, Chrome CDP skills worked well, and the model struck a more usable effort balance than GPT-5.4 inside coding workflows. badlogicgames' bash computer-use post reported a similar result from another angle, saying GPT-5.5 with minimal thinking worked especially well for bash-based computer use.

Benchmarks

The third-party numbers point in the same direction as the early usage reports, but with two different stories depending on what you care about: speed-efficient runs versus long-context retrieval.

Hangsiin's KSST result said GPT-5.5 Pro tied GPT-5.4 Pro on score but cut per-sample generation time from roughly 20 to 50 minutes down to about 5 to 10 minutes. That is not a headline score gain, but it is a very large runtime change if the same quality holds outside a saturated test.

Dillon Uzar's Context Arena results measured the biggest clean delta in the evidence set:

GDM-MRCRv2 AUC at 128k, xhigh tier: 80.6% to 91.7%, +11.1 points.
Best 128k average: 86.8% to 94.5%, +7.7 points.
128k gap versus Claude Opus 4.6 medium: 91.7% versus 81.0%, +10.7 points for GPT-5.5 xhigh.
512k pointwise, best tier: GPT-5.5 medium at 57.6% versus GPT-5.4 xhigh at 41.5%, +16.1 points.

That same Context Arena post also surfaced an important caveat: GPT-5.5's reasoning tiers nearly collapsed together at 512k, which Dillon Uzar read as a sign that base long-context ability, not extra reasoning effort, becomes the bottleneck at that length.

Two more external boards moved quickly. Wes Roth's Terminal-Bench post said GPT-5.5 hit a state-of-the-art 82.7 on Terminal-Bench, while Lech Mazur's Extended NYT Connections update showed GPT-5.5 overtaking Opus 4.6 and improving over GPT-5.4 across every listed reasoning tier on that benchmark.

Ecosystem rollouts

The rollout story was unusually immediate. GPT-5.5 showed up in both commercial IDEs and open research harnesses on launch day.

Cursor's GPT-5.5 availability post announced same-day support in Cursor and attached a 72.8% CursorBench score. Lewis Tunstall's ml-intern update said Hugging Face added GPT-5.5 to ml-intern, giving it access to internal buckets, jobs, and repos for agentic research work at scale.

The linked ml-intern PR matters because it shows where some of these first impressions are coming from. Engineers were not only chatting with the model in a playground, they were slotting it into existing tool stacks and multi-agent workflows within hours.

Prompt migration

OpenAI's own docs suggest the jump from GPT-5.4 to GPT-5.5 is not supposed to be invisible.

Simon Willison's GPT-5.5 prompting-guide post

According to Simon Willison's GPT-5.5 prompting-guide post, OpenAI tells developers to treat GPT-5.5 as a new family to tune for, start with the smallest prompt that preserves the product contract, and adjust reasoning effort, verbosity, tool descriptions, and output format from there. Willison also highlighted one concrete UX trick from the guide: before a multi-step task makes tool calls, send a short user-visible progress update so the model does not look frozen.

Simon Willison's note on Romain Huet

A second OpenAI detail surfaced through Simon Willison's note on Romain Huet: since GPT-5.4, Codex and the main model have been unified into one system, and GPT-5.5 extends that line rather than reviving a separate coding model. That helps explain why so many day-one reports were really harness reports, not just model reports. The coding stack and the general model stack are now the same release surface.

Simon Willison's Huet quote and Simon Willison's prompting-guide summary also sharpen the pricing question raised in Context Arena's pricing note. GPT-5.5 may cost 2x more per token than GPT-5.4, but OpenAI's own docs and the early user reports both frame the release around shorter visible runs, less dead air before action, and better behavior inside tool-heavy harnesses.

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads

TL;DR3 posts

Tool loops2 posts

Benchmarks2 posts

Other sources· 2 posts

Quoting Romain Huet

Since GPT-5.4, we’ve unified Codex and the main model into a single system, so there’s no separate coding line anymore. GPT-5.5 takes this further, with strong gains in agentic coding, computer use, and any task on a computer. — Romain Huet, confirming OpenAI won't release a GPT-5.5-Codex model Tags: generative-ai, gpt, openai, ai, llms

GPT-5.5 prompting guide

GPT-5.5 prompting guide Now that GPT-5.5 is available in the API, OpenAI have released a wealth of useful tips on how best to prompt the new model. Here's a neat trick they recommend for applications that might spend considerable time thinking before returning a user-visible response: Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step. Keep it to one or two sentences. I've already noticed their Codex app doing this, and it does make longer running tasks feel less like the model has crashed. OpenAI suggest running the following in Codex to upgrade your existing code using advice embedded in their openai-docs skill: $openai-docs migrate this project to gpt-5.5 The upgrade guide the coding agent will follow is this one, which even includes light instructions on how to rewrite prompts to better fit the model. Also relevant is the Using GPT-5.5 guide, which opens with this warning: To get the most out of GPT-5.5, treat it as a new model family to tune for, not a drop-in replacement for gpt-5.2 or gpt-5.4. Begin migration with a fresh baseline instead of carrying over every instruction from an older prompt stack. Start with the smallest prompt that preserves the product contract, then tune reasoning effort, verbosity, tool descriptions, and output format against representative examples. Interesting to see OpenAI recommend starting from scratch rather than trusting that existing prompts optimized for