ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti
The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.

TL;DR
- ATLAS reports that a frozen Qwen3-14B Q4 model on one RTX 5060 Ti 16GB reached 74.6% pass@1-v(k=3) on LiveCodeBench v5, with no fine-tuning or API calls, according to the ATLAS repo and the HN summary.
- The result comes from harness design more than base-model changes: ATLAS stacks constraint-driven generation, self-verified iterative refinement, PlanSearch, and PR-CoT repair around the model, as described in the project page.
- Hacker News discussion says this is not a clean one-shot comparison to Claude Sonnet or GPT-style benchmark scores, because ATLAS uses “best-of-3,” selection, and iterative repair in a longer-running loop HN discussion.
- The practical takeaway from the thread is that a local 14B model can be amplified for coding tasks by an asynchronous test-and-repair harness, though commenters also questioned parts of the methodology and the learned “cost field” HN summary fresh delta.
What actually drove the score
itigges22/ATLAS: Adaptive Test-time Learning and Autonomous Specialization
479 upvotes · 275 comments
ATLAS is a self-hosted coding harness, not a new fine-tuned base model. The project page says it runs Qwen3-14B-Q4_K_M on a single consumer GPU and reaches 74.6% on LiveCodeBench v5 by combining constraint-driven generation, “self-verified iterative refinement,” PlanSearch, and PR-CoT repair. The same page reports 47.0% on GPQA Diamond and 14.7% on SciCode.
The HN summary frames the core engineering point more narrowly than the viral headline: a frozen mid-sized model can “compete with hosted frontier systems” when wrapped in a multi-pass harness that generates several candidates, tests them, and repairs failures. One HN commenter summarized it as a system that “generates multiple solutions, tests them, and repairs failures over time,” which makes it more suited to asynchronous coding workflows than interactive streaming output HN summary.
Why engineers are pushing back on the comparison
Fresh discussion on $500 GPU outperforms Claude Sonnet on coding benchmarks
479 upvotes · 275 comments
The strongest skepticism is about benchmark comparability, not whether ATLAS is interesting. The fresh discussion says today's comments mostly sharpened the point that this is “not apples-to-apples,” because ATLAS uses a long-running harness with multi-pass repair and different task handling, while frontier-model leaderboard numbers are often presented as single-shot results.
Discussion around $500 GPU outperforms Claude Sonnet on coding benchmarks
479 upvotes · 275 comments
That criticism gets specific in the HN discussion: Aurornis says the setup uses “best-of-3, Lens selection, and iterative repair,” so the score should not be read as a controlled head-to-head against Claude Sonnet. Another commenter argued the learned “cost field” looks like a difficulty classifier trained on English task descriptions and then applied to Python code embeddings, which may miss the difference between wrong-simple and correct-complex solutions HN discussion. A separate thread also notes that for local agentic coding, Q4 is often the practical floor, because heavier quantization can degrade results HN discussion.