updateMarch 28, 2026

ARC-AGI-3 compares agent runs with human-efficiency scoring as HN critiques the metric

Fresh ARC-AGI-3 discussion centers on how its human-efficiency score mixes completion with time and tool-use efficiency. Critics say the metric can hide different failure modes, even when the benchmark still surfaces exploration and planning behavior that static tests miss.

2 min read

ARC-AGI-3 compares agent runs with human-efficiency scoring as HN critiques the metric

TL;DR

ARC Prize's ARC-AGI-3 page frames ARC-AGI-3 as an interactive agent benchmark, not a static puzzle set: runs must explore novel environments, pick up goals without natural-language instructions, and build an adaptable world model over time.
The engineering hook in the HN core summary is its human-efficiency score, which tries to compare agents on skill acquisition, long-horizon planning, and adaptation rather than just final task completion.
Fresh Hacker News discussion in the latest thread summary argues that this score is hard to read because it mixes completion with efficiency against a human baseline, so one number can blur several different failure modes.
Even critics in the discussion roundup largely focus on interpretation, not on whether the benchmark surfaces useful behaviors; the narrower claim is that ARC-AGI-3 may be more revealing for exploration and tool-using agents than static tests, while still being too narrow to support broad AGI claims.

What does ARC-AGI-3 actually measure?

Hacker News

ARC-AGI-3

496 upvotes · 364 comments

ARC-AGI-3 is built to test agent behavior under interaction. The benchmark page says tasks require systems to explore, acquire goals on the fly, and learn continuously without language instructions, with replayable runs, an interactive UI, and developer tooling for integration. That makes it more like an evaluation harness for agent loops than a one-shot reasoning exam.

Hacker News

Fresh discussion on ARC-AGI-3

496 upvotes · 364 comments

The debate is over what its headline score means. The HN core summary says the real engineering question is less whether a model can solve the environments and more what the metric measures under different tool-access and scoring choices. In the thread, one commenter defined AGI as an unaided agent solving roughly "90% of the most difficult tasks" discussion roundup, while another argued ARC-style benchmarks are still "a certain class of games" rather than a general intelligence test discussion roundup. The sharper criticism is aimed at scoring: a commenter said ARC-AGI-3 "does too much and tells too much" discussion roundup, because completion, speed, and efficiency are compressed into one human-baselined number.

TL;DR

What does ARC-AGI-3 actually measure?

Discussion across the web