releaseMarch 27, 2026

ARC-AGI-3 launches interactive benchmark for world-model reasoning

ARC-AGI-3 introduced an interactive reasoning benchmark that measures world-model building and skill acquisition without natural-language instructions. Early discussion is focused on Duke harness results with generic tools and whether the scoring rewards generalization or benchmark-specific optimization.

Evals Benchmarks

3 min read

ARC-AGI-3 launches interactive benchmark for world-model reasoning

TL;DR

ARC Prize's ARC-AGI-3 page introduces an interactive benchmark for agents that must explore environments, infer goals, and build world models without natural-language instructions.
The benchmark defines a perfect score as matching human play efficiency, and its launch materials frame the task around "skill-acquisition efficiency," sparse-feedback planning, and continual adaptation.
Early Hacker News discussion is centered on whether ARC-AGI-3 is measuring agent reasoning or harness design, with one discussion summary highlighting a run that used only READ, GREP, Bash, and Python to solve preview games.
The other immediate fault line is scoring: ARC Prize says its metric discounts brute force and rewards harder solves, while commenters in the HN thread argue the formula may still blur generalization with benchmark-specific optimization.

What exactly shipped?

Hacker Newspage495 points360 comments

ARC-AGI-3

Posted by lairv

ARC-AGI-3 is the first interactive reasoning benchmark designed to measure human-like intelligence in AI agents. It challenges AI to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously without natural-language instructions. A 100% score means agents beat every game as efficiently as humans. It measures skill-acquisition efficiency over time, long-horizon planning with sparse feedback, and experience-driven adaptation. Features include replayable runs, a developer toolkit, interactive UI for testing, and comprehensive docs.

Open linked page Open HN thread

ARC-AGI-3 is a new interactive reasoning benchmark, not another static puzzle set. Its spec page says agents have to "explore novel environments," "acquire goals on the fly," and learn from interaction instead of following text instructions, with replayable runs, a developer toolkit, and a testing UI available through the project materials benchmark page.

The benchmark's core claim is that it measures how efficiently an agent picks up new skills over time. In the launch description, a 100% score means beating every game "as efficiently as humans," which makes action count and exploration strategy part of the evaluation rather than just final-task accuracy.

Why are engineers arguing about the harness and metric already?

Hacker Newscore495 points360 comments

ARC-AGI-3

Posted by lairv

Relevant as a benchmark/eval thread: commenters are debating whether ARC-AGI-3 measures agentic reasoning, how the harness and allowed tools affect results, and whether the scoring metric meaningfully rewards generalization versus benchmark-specific optimization.

Discussed by

aogaili on Duke harness result
daveguy on tooling and harness
famouswaffles on scoring critique

Open HN thread Open HN thread

The first technical argument is tool access. A fresh comment highlighted in the discussion summary says Opus solved all three preview games in 1,069 actions using generic READ, GREP, Bash, and Python, writing its own BFS, building a grid parser, and using Gaussian elimination for a Lights Out puzzle. Another commenter in the same thread pushed back that giving an agent a path-finding tool is a "crutch" if the benchmark is supposed to reflect raw model capability.

The second argument is scoring. François Chollet wrote in the HN thread that the metric is meant to "discount brute-force attempts" and reward harder levels, borrowing from SPL-style robotics evaluation. But another commenter in that same discussion summary argued that if easy tasks distort the mean, the cleaner fix is changing task composition rather than relying on a more complex formula.

There are already early performance claims around the benchmark, but they are still being treated cautiously. One X post said Symbolica's Agentica SDK hit 36.08% "in a single day" and at lower cost than brute force, while also saying the result needed to be "verified" and acknowledging the "debate about harnessing" Agentica claim.

🧾 More sources

Hacker Newsdiscussion495 points360 comments

Discussion around ARC-AGI-3

Posted by lairv

Thread discussion highlights: - aogaili on Duke harness result: With just generic READ/GREP/BASH+Python tools, Opus completed all three preview games in 1,069 actions, wrote its own BFS, built a grid parser, and solved a Lights Out puzzle with Gaussian elimination. - daveguy on tooling and harness: Pointing out what tools to use is part of the intelligence, and one of the tools is a path finding algorithm — a crutch compared with a regular LLM that has no such capability. - famouswaffles on scoring critique: If easy questions distort the mean, the obvious fix is to reduce the proportion of easy questions, not invent a convoluted scoring method to compensate after the fact.

Discussed by

aogaili on Duke harness result
daveguy on tooling and harness
famouswaffles on scoring critique

Open HN thread Open HN thread

Why are engineers arguing about the harness and metric already?1 tweets

Community discussion focused on whether tool-enabled agents are being fairly measured and whether the scoring method rewards the intended kind of generalization.

releaseMarch 27, 2026

ARC-AGI-3 launches interactive benchmark for world-model reasoning

Evals Benchmarks

3 min read

TL;DR

ARC Prize's ARC-AGI-3 page introduces an interactive benchmark for agents that must explore environments, infer goals, and build world models without natural-language instructions.
The benchmark defines a perfect score as matching human play efficiency, and its launch materials frame the task around "skill-acquisition efficiency," sparse-feedback planning, and continual adaptation.
Early Hacker News discussion is centered on whether ARC-AGI-3 is measuring agent reasoning or harness design, with one discussion summary highlighting a run that used only READ, GREP, Bash, and Python to solve preview games.
The other immediate fault line is scoring: ARC Prize says its metric discounts brute force and rewards harder solves, while commenters in the HN thread argue the formula may still blur generalization with benchmark-specific optimization.

What exactly shipped?

Hacker Newspage495 points360 comments

ARC-AGI-3

Posted by lairv

Open linked page Open HN thread

Why are engineers arguing about the harness and metric already?

Hacker Newscore495 points360 comments

ARC-AGI-3

Posted by lairv

Discussed by

aogaili on Duke harness result
daveguy on tooling and harness
famouswaffles on scoring critique

Open HN thread Open HN thread

🧾 More sources

Hacker Newsdiscussion495 points360 comments

Discussion around ARC-AGI-3

Posted by lairv

Discussed by

aogaili on Duke harness result
daveguy on tooling and harness
famouswaffles on scoring critique

Open HN thread Open HN thread

Why are engineers arguing about the harness and metric already?1 tweets

Community discussion focused on whether tool-enabled agents are being fairly measured and whether the scoring method rewards the intended kind of generalization.