Skip to content
AI Primer
release

ARC-AGI-3 launches interactive benchmark for world-model reasoning

ARC-AGI-3 introduced an interactive reasoning benchmark that measures world-model building and skill acquisition without natural-language instructions. Early discussion is focused on Duke harness results with generic tools and whether the scoring rewards generalization or benchmark-specific optimization.

3 min read
ARC-AGI-3 launches interactive benchmark for world-model reasoning
ARC-AGI-3 launches interactive benchmark for world-model reasoning

TL;DR

  • ARC Prize's ARC-AGI-3 page introduces an interactive benchmark for agents that must explore environments, infer goals, and build world models without natural-language instructions.
  • The benchmark defines a perfect score as matching human play efficiency, and its launch materials frame the task around "skill-acquisition efficiency," sparse-feedback planning, and continual adaptation.
  • Early Hacker News discussion is centered on whether ARC-AGI-3 is measuring agent reasoning or harness design, with one discussion summary highlighting a run that used only READ, GREP, Bash, and Python to solve preview games.
  • The other immediate fault line is scoring: ARC Prize says its metric discounts brute force and rewards harder solves, while commenters in the HN thread argue the formula may still blur generalization with benchmark-specific optimization.

What exactly shipped?

Y
Hacker News

ARC-AGI-3

495 upvotes · 360 comments

ARC-AGI-3 is a new interactive reasoning benchmark, not another static puzzle set. Its spec page says agents have to "explore novel environments," "acquire goals on the fly," and learn from interaction instead of following text instructions, with replayable runs, a developer toolkit, and a testing UI available through the project materials benchmark page.

The benchmark's core claim is that it measures how efficiently an agent picks up new skills over time. In the launch description, a 100% score means beating every game "as efficiently as humans," which makes action count and exploration strategy part of the evaluation rather than just final-task accuracy.

Why are engineers arguing about the harness and metric already?

Y
Hacker News

ARC-AGI-3

495 upvotes · 360 comments

The first technical argument is tool access. A fresh comment highlighted in the discussion summary says Opus solved all three preview games in 1,069 actions using generic READ, GREP, Bash, and Python, writing its own BFS, building a grid parser, and using Gaussian elimination for a Lights Out puzzle. Another commenter in the same thread pushed back that giving an agent a path-finding tool is a "crutch" if the benchmark is supposed to reflect raw model capability.

The second argument is scoring. François Chollet wrote in the HN thread that the metric is meant to "discount brute-force attempts" and reward harder levels, borrowing from SPL-style robotics evaluation. But another commenter in that same discussion summary argued that if easy tasks distort the mean, the cleaner fix is changing task composition rather than relying on a more complex formula.

There are already early performance claims around the benchmark, but they are still being treated cautiously. One X post said Symbolica's Agentica SDK hit 36.08% "in a single day" and at lower cost than brute force, while also saying the result needed to be "verified" and acknowledging the "debate about harnessing" Agentica claim.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
Why are engineers arguing about the harness and metric already?1 post
Share on X