Fresh ARC-AGI-3 discussion centers on how its human-efficiency score mixes completion with time and tool-use efficiency. Critics say the metric can hide different failure modes, even when the benchmark still surfaces exploration and planning behavior that static tests miss.

Posted by lairv
ARC-AGI-3 is the first interactive reasoning benchmark designed to measure human-like intelligence in AI agents. It challenges AI to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously without natural-language instructions. A 100% score means AI agents match human efficiency. It measures skill-acquisition efficiency over time, long-horizon planning, and adaptation. Features include replayable runs, developer toolkit, interactive UI, and docs for agent integration.
ARC-AGI-3 is built to test agent behavior under interaction. The benchmark page says tasks require systems to explore, acquire goals on the fly, and learn continuously without language instructions, with replayable runs, an interactive UI, and developer tooling for integration. That makes it more like an evaluation harness for agent loops than a one-shot reasoning exam.
Posted by lairv
Today’s comments mostly continued the same debate, but with a bit more emphasis on the definition problem: one commenter tried to pin down AGI as an unaided agent succeeding on roughly 90% of very hard tasks, while another argued that ARC-style benchmarks are still just tests over a narrow class of games and don’t justify a broad AGI claim. The other fresh thread was a sharper critique of the scoring methodology itself, arguing that the metric can be hard to interpret because it mixes completion with efficiency against a human baseline, so a low score may conflate many different failure modes. That said, this was more continuation than a new technical revelation.
The debate is over what its headline score means. The HN core summary says the real engineering question is less whether a model can solve the environments and more what the metric measures under different tool-access and scoring choices. In the thread, one commenter defined AGI as an unaided agent solving roughly "90% of the most difficult tasks" discussion roundup, while another argued ARC-style benchmarks are still "a certain class of games" rather than a general intelligence test discussion roundup. The sharper criticism is aimed at scoring: a commenter said ARC-AGI-3 "does too much and tells too much" discussion roundup, because completion, speed, and efficiency are compressed into one human-baselined number.
Posted by lairv
ARC-AGI-3 is relevant as an agent benchmark: it tests exploration, tool use, long-horizon planning, and efficiency under a harness that researchers and labs could use to compare systems. The useful engineering question is less “can it solve puzzles?” and more “what does the score actually measure, and how stable is that signal across tool access and scoring choices?”
Posted by lairv
Thread discussion highlights: - fc417fc802 on AGI definition: Approximately that an unaided agent must, with no outside assistance, be able to solve ~90% of the most difficult tasks that we throw at it with a ~90% success rate. - famouswaffles on scoring critique: It should be clear if there are a class of obviously easy questions. And if that's not clear then it makes the scoring even worse... The scoring for 3 is just bad. It does too much and tells too much. - lukev on benchmark validity: This measures the ability of a LLM to succeed in a certain class of games... the argument for whether a LLM is 'AGI' or not should not be whether a LLM does well on any given class of games.