breakingMay 9, 2026

GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use

User posts and HN threads compared GPT-5.5 and Opus 4.7 across plan mode, frontend work, and 120K-context sessions. The split results mean token burn and instruction discipline matter as much as raw benchmark scores.

8 min read

GPT-5.5 vs Opus 4.7: users compare plan mode, frontend output, and 120K-context use

TL;DR

bridgemindai's Codex screenshot and steipete's /goal post are the clearest bull case for GPT-5.5 in coding: plan-first runs that generate assumptions, test plans, and long refactors before touching code.
arena's new frontend category rollout, arena's image-to-webdev leaderboard, and Yuchenj_UW's frontend gripe all point the same way on UI work: Opus 4.7 still owns the prettier frontend outputs, while GPT-5.5 is catching up but not leading.
On bigger codebase tasks, testingcatalog's SWE Atlas summary and TheTuringPost's benchmark recap put Claude Code with Opus 4.7 on top for refactoring, but nrehiew_'s LongCodeEdit run shows GPT-5.5, Opus 4.7, and Opus 4.6 clustering together by 512K context.
The sharpest complaints about GPT-5.5 were not about raw intelligence but harness behavior: badlogicgames and badlogicgames' follow-up said it skimmed only the first chunk of long files, while badlogicgames' 120K-token thread said review comments still missed lifecycle breakage.
Opus 4.7 shipped its own gotchas. the main HN thread surfaced xhigh effort defaults, tokenizer overhead, and changed reasoning-display behavior, while aibuilderclub_'s settings thread claimed Claude Code users could recover reliability by disabling adaptive thinking and restoring visible summaries.

You can read OpenAI's GPT-5.5 launch post, Anthropic's Opus 4.7 announcement, browse Scale's SWE Atlas Refactoring leaderboard, and dig through Arena's new frontend categories. The weird bits are in the margins: nrehiew_ said Opus 4.7 cost about 20% more than 4.6 on long-context tests, a Reddit user in r/ClaudeAI said German prompts could burn an entire Opus 4.7 session in seconds, and ClaudeCodeLog's changelog thread showed Anthropic shipping a fix because plan mode was not always blocking file writes.

Plan mode

The strongest GPT-5.5 praise was not benchmark talk. It was workflow talk.

In bridgemindai's example, Codex with GPT-5.5 xhigh writes a test plan, lists assumptions, then waits for the explicit "implement the plan" handoff. That lines up with OpenAI's launch post, which framed GPT-5.5 around complex tool-using work in ChatGPT and Codex before API rollout.

The /goal posts make the same point from a different angle. pvncher's RepoPrompt screenshot shows Codex acting like a coordinator that decomposes a feature, dispatches a narrow implementation run, and then waits on validation. steipete went further, saying /goal + GPT 5.5 made extensive refactors with end-to-end tests "just work," complete with a 17 hour 59 minute "Goal achieved" screenshot.

The catch is cost and latency. an Andrew Chen repost via TheRealAdamG called /goal likely to increase token use by "10000x," and the GPT-5.5 HN thread highlighted the same issue from the other side, with commenters arguing that per-token price is a poor proxy for per-task cost once the harness starts planning, retrying, and using tools heavily.

Frontend

If the task is code that has to look good, the crowd was much less split.

Arena's new domain-specific frontend views, built from 250,000 plus prompts and broken into seven categories, put Anthropic models in the top four across all seven buckets according to arena's rollout thread. The attached leaderboard shows claude-opus-4-7-thinking at #1 and claude-opus-4-7 at #2 in the aggregate frontend view, ahead of GPT entries.

On visual-input web tasks, arena's image-to-webdev update said half the top 10 turned over in a month, but the headline still favored Anthropic. Opus 4.7 Thinking landed at #1, standard Opus 4.7 at #3, while GPT-5.5 showed up at #6 and #8.

The qualitative posts matched the leaderboard shape. Yuchenj_UW said Opus 4.7 has a recognizable "Anthropic flavor" in HTML design, while GPT-5.5 still feels weak at frontend taste. willdepue's longer comparison made a similar argument in prose, saying Claude remains better at explanations and stylistic judgment even after GPT-5.5 closed a large gap.

That does not make GPT-5.5 absent from frontend-adjacent workflows. ComfyUI's integration post positioned GPT-5.5 around structured output and reliable one-shot node behavior, and arena's GPT-5.5 Instant rankings showed the cheaper instant variant landing mid-pack across text, vision, occupational, and document leaderboards.

Refactoring

The benchmark split is where the story gets more interesting.

Scale's new SWE Atlas Refactoring benchmark is explicitly about restructuring code, not just fixing a bug. According to TheTuringPost's breakdown, the 70 tasks span decomposition, interface evolution, extraction, and relocation, with solutions requiring about 2x more changed lines and 1.7x more edited files than SWE-Bench Pro. On that board, testingcatalog said Claude Code with Opus 4.7 led Codex with GPT-5.5, followed by GPT-5.4 and GPT-5.3.

Long-context editing looked flatter. nrehiew_ ran LongCodeEdit out to 512K tokens and reported that Opus 4.6, Opus 4.7, and GPT-5.5 had similar overall performance, with Opus 4.6 slightly ahead. The attached caveat in nrehiew_'s follow-up matters: each context bin had only 20 problems, and difficulty was not normalized across bins.

A few other eval fragments fill in the edges:

haider1's Blueprint-Bench post said GPT-5.5 led a spatial reasoning benchmark for turning apartment photos into 2D floor plans, ahead of GPT-5.4, Gemini 3.1 Pro, and Opus 4.7.
WesRoth's IFBench chart showed Grok 4.3 ahead on instruction following, with GPT-5.5 above Opus 4.7 in that specific leaderboard.
rohanpaul_ai's ARC-AGI-3 repost said both GPT-5.5 and Opus 4.7 were below 1% on ARC-AGI-3, which is a useful reminder that neither model is reliably strong on every long-horizon reasoning frame.

Harness failures

The loudest anti-GPT-5.5 posts were about agent behavior inside real harnesses, not first-party launch claims.

OpenAI's announcement sold GPT-5.5 as faster and more capable for coding, research, and data analysis. The complaints from heavy users were narrower and uglier. badlogicgames said GPT-5.5 was trained to refuse reading full files, and their follow-up described a tool call pattern that read only the first 100 lines of a 700 line file.

The same author said in a later post that they were going back to GPT-5.3 Codex because "5.5 doesn't read stuff anymore," both in pi and in the Codex CLI. their 120K-token thread adds a second failure mode: the agent could absorb a large context window and still leave important review comments unresolved, enough that tests passed while lifecycle behavior stayed broken.

Other hands-on reports were more mixed than outright negative. BEBischof called GPT-5.5 a regression from 5.4 for real tasks and worse than Opus 4.7 as a thought partner, mainly because of weak context use and repeated mistakes across harnesses. But a repost from rezoundous via TheRealAdamG said they trusted GPT-5.5 more than Opus 4.7, a repost from Mahmoud Ashraf via steipete said the difference after switching OpenClaw to GPT-5.5 was visible, and willdepue argued the bull case for 5.5 is that Opus models sometimes feel like they barely think at all.

That leaves the cleanest reading as a harness story. GPT-5.5 looked strongest when the scaffold forced planning, decomposition, and explicit review, and weakest when the agent decided to be clever about what not to read.

Claude Code knobs

Opus 4.7 brought its own bundle of behavior changes, and the most concrete ones were hidden in threads, comments, and changelogs rather than the headline launch copy.

Hacker News

Claude Opus 4.7

2k upvotes · 1.5k comments

Anthropic's official announcement said Opus 4.7 kept Opus 4.6 pricing at $5 per million input tokens and $25 per million output tokens, added 2576 px image support, exposed a 1M token context window, and introduced a new xhigh effort level. The HN discussion in the main thread focused on the buried operational details instead: xhigh becoming the Claude Code default, tokenizer overhead in the 1.0 to 1.35x range, and reasoning summaries no longer showing up by default unless the output display is configured explicitly.

The user reports around limits and spend fit that framing. pvncher estimated the Pro plan delivered about 10 to 12 minutes of Opus 4.7 at xhigh per five hours, while a Reddit user in r/ClaudeAI said a German stock-analysis prompt that used 37% of an English session could consume 100% of a German session almost immediately.

Then there is the adaptive-thinking thread. aibuilderclub_ claimed Claude Code had been hallucinating API versions, package names, and commit SHAs on turns where it emitted zero reasoning tokens, and their follow-up attributed that pattern to adaptive thinking choosing not to think on familiar-looking prompts. The proposed fixes were specific:

disable adaptive thinking with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
set effort to high with CLAUDE_CODE_EFFORT_LEVEL=high
restore visible summaries with showThinkingSummaries: true

The changelog in ClaudeCodeLog's thread adds one more concrete detail to that picture: Claude Code 2.1.136 fixed plan mode not blocking file writes when a matching Edit(...) allow rule existed, alongside OAuth refresh bugs and new settings.autoMode.hard_deny behavior. For a story dominated by model comparisons, that is a useful reminder that some of the weirdest "model" behavior is still harness behavior leaking through the UI.