updateMarch 28, 2026

ARC-AGI-3 compares agent runs with human-efficiency scoring as HN critiques the metric

Fresh ARC-AGI-3 discussion centers on how its human-efficiency score mixes completion with time and tool-use efficiency. Critics say the metric can hide different failure modes, even when the benchmark still surfaces exploration and planning behavior that static tests miss.

Benchmarks Evals

2 min read

ARC-AGI-3 compares agent runs with human-efficiency scoring as HN critiques the metric

TL;DR

ARC Prize's ARC-AGI-3 page frames ARC-AGI-3 as an interactive agent benchmark, not a static puzzle set: runs must explore novel environments, pick up goals without natural-language instructions, and build an adaptable world model over time.
The engineering hook in the HN core summary is its human-efficiency score, which tries to compare agents on skill acquisition, long-horizon planning, and adaptation rather than just final task completion.
Fresh Hacker News discussion in the latest thread summary argues that this score is hard to read because it mixes completion with efficiency against a human baseline, so one number can blur several different failure modes.
Even critics in the discussion roundup largely focus on interpretation, not on whether the benchmark surfaces useful behaviors; the narrower claim is that ARC-AGI-3 may be more revealing for exploration and tool-using agents than static tests, while still being too narrow to support broad AGI claims.

What does ARC-AGI-3 actually measure?

Hacker Newspage496 points364 comments

ARC-AGI-3

Posted by lairv

ARC-AGI-3 is the first interactive reasoning benchmark designed to measure human-like intelligence in AI agents. It challenges AI to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously without natural-language instructions. A 100% score means AI agents match human efficiency. It measures skill-acquisition efficiency over time, long-horizon planning, and adaptation. Features include replayable runs, developer toolkit, interactive UI, and docs for agent integration.

Open linked page Open HN thread

ARC-AGI-3 is built to test agent behavior under interaction. The benchmark page says tasks require systems to explore, acquire goals on the fly, and learn continuously without language instructions, with replayable runs, an interactive UI, and developer tooling for integration. That makes it more like an evaluation harness for agent loops than a one-shot reasoning exam.

Hacker Newsdelta496 points364 comments

Fresh discussion on ARC-AGI-3

Posted by lairv

Today’s comments mostly continued the same debate, but with a bit more emphasis on the definition problem: one commenter tried to pin down AGI as an unaided agent succeeding on roughly 90% of very hard tasks, while another argued that ARC-style benchmarks are still just tests over a narrow class of games and don’t justify a broad AGI claim. The other fresh thread was a sharper critique of the scoring methodology itself, arguing that the metric can be hard to interpret because it mixes completion with efficiency against a human baseline, so a low score may conflate many different failure modes. That said, this was more continuation than a new technical revelation.

Discussed by

fc417fc802 on AGI definition
famouswaffles on scoring critique

Open HN thread Open HN thread

The debate is over what its headline score means. The HN core summary says the real engineering question is less whether a model can solve the environments and more what the metric measures under different tool-access and scoring choices. In the thread, one commenter defined AGI as an unaided agent solving roughly "90% of the most difficult tasks" discussion roundup, while another argued ARC-style benchmarks are still "a certain class of games" rather than a general intelligence test discussion roundup. The sharper criticism is aimed at scoring: a commenter said ARC-AGI-3 "does too much and tells too much" discussion roundup, because completion, speed, and efficiency are compressed into one human-baselined number.

🧾 More sources

Hacker Newscore496 points364 comments

ARC-AGI-3

Posted by lairv

ARC-AGI-3 is relevant as an agent benchmark: it tests exploration, tool use, long-horizon planning, and efficiency under a harness that researchers and labs could use to compare systems. The useful engineering question is less “can it solve puzzles?” and more “what does the score actually measure, and how stable is that signal across tool access and scoring choices?”

Discussed by

fc417fc802 on AGI definition
famouswaffles on scoring critique
lukev on benchmark validity

Open HN thread Open linked page

Hacker Newsdiscussion496 points364 comments

Discussion around ARC-AGI-3

Posted by lairv

Thread discussion highlights: - fc417fc802 on AGI definition: Approximately that an unaided agent must, with no outside assistance, be able to solve ~90% of the most difficult tasks that we throw at it with a ~90% success rate. - famouswaffles on scoring critique: It should be clear if there are a class of obviously easy questions. And if that's not clear then it makes the scoring even worse... The scoring for 3 is just bad. It does too much and tells too much. - lukev on benchmark validity: This measures the ability of a LLM to succeed in a certain class of games... the argument for whether a LLM is 'AGI' or not should not be whether a LLM does well on any given class of games.

Discussed by

fc417fc802 on AGI definition
famouswaffles on scoring critique
lukev on benchmark validity

Open HN thread Open linked page