ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3
ARC Prize published frontier-model results on ARC-AGI-3 and said GPT-5.5 and Opus 4.7 both stayed below 1%, with failures in world modeling, abstraction, and reward reinforcement. That shows strong coding and benchmark models still break on novel interactive reasoning tasks, and follow-up comparisons even had Opus 4.6 slightly ahead of 4.7.

TL;DR
- In arcprize's main results thread, ARC Prize put GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3's semi-private set, which keeps both frontier models below 1% on a benchmark built around novel interactive environments.
- According to arcprize's blog link post and the official analysis, the team's main finding was not the raw score but three recurring failures: weak world models, bad abstraction transfer from training data, and poor reward reinforcement across levels.
- arcprize's model comparison post argued the two models broke differently: Opus 4.7 compressed observations into confident but wrong theories, while GPT-5.5 generated broader hypotheses and often failed to compress them into a plan at all.
- In AiBattle_'s comparison chart, Opus 4.6 scored 0.5% and GPT-5.4 scored 0.2%, which means the newer Opus 4.7 result landed below its predecessor on this benchmark.
You can read the full writeup, browse the ARC-AGI-3 overview, and inspect public replays from the evidence thread. arcprize's public demo link post points to a playable task browser, and fchollet's follow-up explicitly framed the blog post as the part worth reading, not just the sub-1% score screenshot.
Scores
ARC Prize says ARC-AGI-3 is built to test exploration, world-model formation, hypothesis testing, recovery from wrong assumptions, and transfer across levels in 135 hand-crafted environments, rather than static puzzle solving on pre-described tasks. The benchmark overview defines 100% as beating every game as efficiently as humans.
The numbers here are tiny, but the ranking is still awkward. AiBattle_'s chart put Opus 4.6 at 0.5%, GPT-5.5 at 0.4%, and Opus 4.7 at 0.2%, while fchollet's reaction post noted that the latest crop remains below 1% for now.
World models
ARC Prize's dominant failure mode was simple: the models noticed local cause and effect, then failed to turn that into a global rule. In the official analysis, the example is a model learning that ACTION3 rotates an object without inferring that orientation controls which side receives a new value.
That distinction matters because ARC-AGI-3 is interactive. The model has to build a working theory while acting, not just pattern-match to a known answer format.
Abstraction traps
ARC Prize says the models repeatedly mapped unfamiliar games onto familiar ones from training data, including Tetris, Frogger, Sokoban, Flood-It, Breakout, and Pong. In the blog post, GPT-5.5 reportedly treated one environment as Breakout instead of a key-combination task, which turned a superficial visual analogy into the wrong action policy.
In fchollet's RL comment, François Chollet, creator of the ARC benchmark family, summarized the pattern more bluntly: reinforcement learning boosts performance in known territory, but in unknown territory the model can hallucinate that it is doing a different task it saw during training.
Compression
The sharpest line in ARC Prize's writeup is the split between wrong compression and failure to compress. In the official analysis, Opus 4.7 is described as better at short-horizon mechanic discovery, but more likely to lock onto a false invariant and execute it aggressively.
GPT-5.5 showed the opposite shape. ARC Prize says it often named the right ingredients, then kept reopening the search space instead of committing to a workable theory, which is how you end up recognizing the mirror effect in a game and still wandering through Tetris, Frogger, Pong, and Tower of Hanoi analogies.
Analysis package
The most useful part of this release is the instrumentation. ARC Prize says it reviewed 160 replays and reasoning traces, wrote a ground-truth strategy for each game, used Codex and Claude Code to compare traces against those strategies, then validated findings by hand before open-sourcing the analysis package in the official post.
There is one caveat buried in the analysis notes: ARC Prize says GPT-5.5 did not return reasoning traces in its default test setup, so the qualitative writeup used a separate analysis-mode run while keeping the official score on the standard harness. Separately, arcprize's one million scorecards post said the preview launch has already generated more than 1 million ARC-AGI-3 scorecards, which explains why the team is treating replay inspection as a product surface rather than a one-off research artifact.