ARC-AGI-3 introduced an interactive reasoning benchmark that measures world-model building and skill acquisition without natural-language instructions. Early discussion is focused on Duke harness results with generic tools and whether the scoring rewards generalization or benchmark-specific optimization.

Posted by lairv
ARC-AGI-3 is the first interactive reasoning benchmark designed to measure human-like intelligence in AI agents. It challenges AI to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously without natural-language instructions. A 100% score means agents beat every game as efficiently as humans. It measures skill-acquisition efficiency over time, long-horizon planning with sparse feedback, and experience-driven adaptation. Features include replayable runs, a developer toolkit, interactive UI for testing, and comprehensive docs.
ARC-AGI-3 is a new interactive reasoning benchmark, not another static puzzle set. Its spec page says agents have to "explore novel environments," "acquire goals on the fly," and learn from interaction instead of following text instructions, with replayable runs, a developer toolkit, and a testing UI available through the project materials benchmark page.
The benchmark's core claim is that it measures how efficiently an agent picks up new skills over time. In the launch description, a 100% score means beating every game "as efficiently as humans," which makes action count and exploration strategy part of the evaluation rather than just final-task accuracy.
Posted by lairv
Relevant as a benchmark/eval thread: commenters are debating whether ARC-AGI-3 measures agentic reasoning, how the harness and allowed tools affect results, and whether the scoring metric meaningfully rewards generalization versus benchmark-specific optimization.
The first technical argument is tool access. A fresh comment highlighted in the discussion summary says Opus solved all three preview games in 1,069 actions using generic READ, GREP, Bash, and Python, writing its own BFS, building a grid parser, and using Gaussian elimination for a Lights Out puzzle. Another commenter in the same thread pushed back that giving an agent a path-finding tool is a "crutch" if the benchmark is supposed to reflect raw model capability.
The second argument is scoring. François Chollet wrote in the HN thread that the metric is meant to "discount brute-force attempts" and reward harder levels, borrowing from SPL-style robotics evaluation. But another commenter in that same discussion summary argued that if easy tasks distort the mean, the cleaner fix is changing task composition rather than relying on a more complex formula.
There are already early performance claims around the benchmark, but they are still being treated cautiously. One X post said Symbolica's Agentica SDK hit 36.08% "in a single day" and at lower cost than brute force, while also saying the result needed to be "verified" and acknowledging the "debate about harnessing" Agentica claim.
Posted by lairv
Thread discussion highlights: - aogaili on Duke harness result: With just generic READ/GREP/BASH+Python tools, Opus completed all three preview games in 1,069 actions, wrote its own BFS, built a grid parser, and solved a Lights Out puzzle with Gaussian elimination. - daveguy on tooling and harness: Pointing out what tools to use is part of the intelligence, and one of the tools is a path finding algorithm — a crutch compared with a regular LLM that has no such capability. - famouswaffles on scoring critique: If easy questions distort the mean, the obvious fix is to reduce the proportion of easy questions, not invent a convoluted scoring method to compensate after the fact.