The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.

Posted by yogthos
ATLAS is an open-source project that enables a frozen 14B model (Qwen3-14B-Q4_K_M) on a single consumer GPU (RTX 5060 Ti 16GB) to achieve 74.6% pass@1-v(k=3) on LiveCodeBench v5 through constraint-driven generation, self-verified iterative refinement, PlanSearch, and PR-CoT repair. No fine-tuning or API calls required; fully self-hosted. Benchmarks: 47.0% GPQA Diamond, 14.7% SciCode. Stars: 1044. Created Feb 2026.
ATLAS is a self-hosted coding harness, not a new fine-tuned base model. The project page says it runs Qwen3-14B-Q4_K_M on a single consumer GPU and reaches 74.6% on LiveCodeBench v5 by combining constraint-driven generation, “self-verified iterative refinement,” PlanSearch, and PR-CoT repair. The same page reports 47.0% on GPQA Diamond and 14.7% on SciCode.
The HN summary frames the core engineering point more narrowly than the viral headline: a frozen mid-sized model can “compete with hosted frontier systems” when wrapped in a multi-pass harness that generates several candidates, tests them, and repairs failures. One HN commenter summarized it as a system that “generates multiple solutions, tests them, and repairs failures over time,” which makes it more suited to asynchronous coding workflows than interactive streaming output HN summary.
Posted by yogthos
Today’s discussion mostly sharpened skepticism about the benchmark methodology rather than adding new evidence about ATLAS itself. Several commenters argued the comparison is not apples-to-apples because ATLAS uses a long-running harness with multi-pass repair and different task sets, so its scores are not directly comparable to single-shot frontier-model results. A second thread focused on whether the internals make sense at all: one commenter questioned the “cost field”/difficulty-classifier approach as operating on the wrong distribution, while another noted that local-model quality can depend heavily on quantization and that Q4 is often the minimum practical floor for coding workloads.
The strongest skepticism is about benchmark comparability, not whether ATLAS is interesting. The fresh discussion says today's comments mostly sharpened the point that this is “not apples-to-apples,” because ATLAS uses a long-running harness with multi-pass repair and different task handling, while frontier-model leaderboard numbers are often presented as single-shot results.
Posted by yogthos
Thread discussion highlights: - Aurornis on benchmark methodology: ATLAS uses a different LiveCodeBench methodology, with best-of-3, Lens selection, and iterative repair on a frozen 14B model, so it is not a controlled head-to-head against single-shot Claude Sonnet or GPT-style scores. - xyzzy123 on cost/difficulty field: The commenter says the learned “cost field” looks like a difficulty classifier over English task descriptions, then gets applied to Python code embeddings, which may not help solve harder problems or distinguish wrong-but-simple from correct-but-complex solutions. - mongrelion on quantization and VRAM: An AMD GPU owner says Q4 seems like the lowest usable quantization for agentic coding, because going below that starts hurting results; they also point to Hugging Face’s GPU-fit indicators and Qwen3.5 as a strong local model family.
That criticism gets specific in the HN discussion: Aurornis says the setup uses “best-of-3, Lens selection, and iterative repair,” so the score should not be read as a controlled head-to-head against Claude Sonnet. Another commenter argued the learned “cost field” looks like a difficulty classifier trained on English task descriptions and then applied to Python code embeddings, which may miss the difference between wrong-simple and correct-complex solutions HN discussion. A separate thread also notes that for local agentic coding, Q4 is often the practical floor, because heavier quantization can degrade results HN discussion.
Posted by yogthos
Useful if you care about local coding agents, benchmark design, and what it takes for a frozen mid-sized model to compete with hosted frontier systems. The thread suggests the main takeaway is not that a $500 GPU beats Claude in a fair one-shot comparison, but that a multi-pass harness can amplify a local model on coding tasks.