GPT-5.5 users report 4-10x shorter runs and smoother tool calls one day after launch
Users and third-party evals reported shorter runs, stronger long-context scores, and faster rollout into Cursor and other tools a day after GPT-5.5 hit the API. Higher per-token pricing may be partly offset by lower loop time and fewer tool-call stalls, so watch early bench data before changing defaults.

TL;DR
- Early hands-on reports from rishdotblog's agent test, omarsar0's Codex thread, and thdxr's latency note all converged on the same change: GPT-5.5 starts acting faster inside tool loops, with less visible stall before it begins work.
- Hangsiin's KSST run reported flat top-line score versus GPT-5.4 Pro but much shorter generations, while Dillon Uzar's Context Arena results measured an +11.1 point jump at 128k AUC for GPT-5.5 xhigh over GPT-5.4 xhigh.
- Third-party leaderboards moved quickly: Wes Roth's Terminal-Bench post said GPT-5.5 took the top Terminal-Bench spot at 82.7, and Lech Mazur's Extended NYT Connections update showed GPT-5.5 passing Opus 4.6 on that benchmark too.
- Day-one rollout was fast: Cursor's availability post put GPT-5.5 into Cursor with a 72.8% CursorBench score, while Lewis Tunstall's ml-intern update added the model to Hugging Face's research agent stack.
- OpenAI's own migration guidance, summarized in Simon Willison's prompting-guide writeup, framed GPT-5.5 as a model family that needs retuning rather than a drop-in swap, even as Context Arena's pricing note flagged a 2x per-token price increase over GPT-5.4.
OpenAI also published a GPT-5.5 prompting guide that tells developers to start from a fresh baseline, not drag old prompt stacks forward. Simon Willison's note on Romain Huet says Codex and the main model line are now unified, so there is no separate coding variant to wait for. You can also inspect the ml-intern pull request that wired GPT-5.5 into Hugging Face's agent harness a few hours after launch.
Tool loops
The most consistent day-one story was not a benchmark chart. It was that GPT-5.5 felt less sticky inside agent runs.
According to rishdotblog's early agent report, GPT-5.5 matched Claude-class tool calling while staying fast enough for long-running agents. thdxr's latency note described the same shift more simply, saying the model now behaves more like Claude because it does not pause for a long time before it starts doing things.
omarsar0's Codex migration thread adds a second useful detail: the smoother feel survived a real harness migration. He said MCP tools worked out of the box, Chrome CDP skills worked well, and the model struck a more usable effort balance than GPT-5.4 inside coding workflows. badlogicgames' bash computer-use post reported a similar result from another angle, saying GPT-5.5 with minimal thinking worked especially well for bash-based computer use.
Benchmarks
The third-party numbers point in the same direction as the early usage reports, but with two different stories depending on what you care about: speed-efficient runs versus long-context retrieval.
Hangsiin's KSST result said GPT-5.5 Pro tied GPT-5.4 Pro on score but cut per-sample generation time from roughly 20 to 50 minutes down to about 5 to 10 minutes. That is not a headline score gain, but it is a very large runtime change if the same quality holds outside a saturated test.
Dillon Uzar's Context Arena results measured the biggest clean delta in the evidence set:
- GDM-MRCRv2 AUC at 128k, xhigh tier: 80.6% to 91.7%, +11.1 points.
- Best 128k average: 86.8% to 94.5%, +7.7 points.
- 128k gap versus Claude Opus 4.6 medium: 91.7% versus 81.0%, +10.7 points for GPT-5.5 xhigh.
- 512k pointwise, best tier: GPT-5.5 medium at 57.6% versus GPT-5.4 xhigh at 41.5%, +16.1 points.
That same Context Arena post also surfaced an important caveat: GPT-5.5's reasoning tiers nearly collapsed together at 512k, which Dillon Uzar read as a sign that base long-context ability, not extra reasoning effort, becomes the bottleneck at that length.
Two more external boards moved quickly. Wes Roth's Terminal-Bench post said GPT-5.5 hit a state-of-the-art 82.7 on Terminal-Bench, while Lech Mazur's Extended NYT Connections update showed GPT-5.5 overtaking Opus 4.6 and improving over GPT-5.4 across every listed reasoning tier on that benchmark.
Ecosystem rollouts
The rollout story was unusually immediate. GPT-5.5 showed up in both commercial IDEs and open research harnesses on launch day.
Cursor's GPT-5.5 availability post announced same-day support in Cursor and attached a 72.8% CursorBench score. Lewis Tunstall's ml-intern update said Hugging Face added GPT-5.5 to ml-intern, giving it access to internal buckets, jobs, and repos for agentic research work at scale.
The linked ml-intern PR matters because it shows where some of these first impressions are coming from. Engineers were not only chatting with the model in a playground, they were slotting it into existing tool stacks and multi-agent workflows within hours.
Prompt migration
OpenAI's own docs suggest the jump from GPT-5.4 to GPT-5.5 is not supposed to be invisible.
Simon Willison's GPT-5.5 prompting-guide post
According to Simon Willison's GPT-5.5 prompting-guide post, OpenAI tells developers to treat GPT-5.5 as a new family to tune for, start with the smallest prompt that preserves the product contract, and adjust reasoning effort, verbosity, tool descriptions, and output format from there. Willison also highlighted one concrete UX trick from the guide: before a multi-step task makes tool calls, send a short user-visible progress update so the model does not look frozen.
Simon Willison's note on Romain Huet
A second OpenAI detail surfaced through Simon Willison's note on Romain Huet: since GPT-5.4, Codex and the main model have been unified into one system, and GPT-5.5 extends that line rather than reviving a separate coding model. That helps explain why so many day-one reports were really harness reports, not just model reports. The coding stack and the general model stack are now the same release surface.
Simon Willison's Huet quote and Simon Willison's prompting-guide summary also sharpen the pricing question raised in Context Arena's pricing note. GPT-5.5 may cost 2x more per token than GPT-5.4, but OpenAI's own docs and the early user reports both frame the release around shorter visible runs, less dead air before action, and better behavior inside tool-heavy harnesses.