breakingMarch 28, 2026

ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti

The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.

3 min read

ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti

TL;DR

ATLAS reports that a frozen Qwen3-14B Q4 model on one RTX 5060 Ti 16GB reached 74.6% pass@1-v(k=3) on LiveCodeBench v5, with no fine-tuning or API calls, according to the ATLAS repo and the HN summary.
The result comes from harness design more than base-model changes: ATLAS stacks constraint-driven generation, self-verified iterative refinement, PlanSearch, and PR-CoT repair around the model, as described in the project page.
Hacker News discussion says this is not a clean one-shot comparison to Claude Sonnet or GPT-style benchmark scores, because ATLAS uses “best-of-3,” selection, and iterative repair in a longer-running loop HN discussion.
The practical takeaway from the thread is that a local 14B model can be amplified for coding tasks by an asynchronous test-and-repair harness, though commenters also questioned parts of the methodology and the learned “cost field” HN summary fresh delta.

What actually drove the score

Hacker News

itigges22/ATLAS: Adaptive Test-time Learning and Autonomous Specialization

479 upvotes · 275 comments

ATLAS is a self-hosted coding harness, not a new fine-tuned base model. The project page says it runs Qwen3-14B-Q4_K_M on a single consumer GPU and reaches 74.6% on LiveCodeBench v5 by combining constraint-driven generation, “self-verified iterative refinement,” PlanSearch, and PR-CoT repair. The same page reports 47.0% on GPQA Diamond and 14.7% on SciCode.

The HN summary frames the core engineering point more narrowly than the viral headline: a frozen mid-sized model can “compete with hosted frontier systems” when wrapped in a multi-pass harness that generates several candidates, tests them, and repairs failures. One HN commenter summarized it as a system that “generates multiple solutions, tests them, and repairs failures over time,” which makes it more suited to asynchronous coding workflows than interactive streaming output HN summary.

Why engineers are pushing back on the comparison

Hacker News

Fresh discussion on $500 GPU outperforms Claude Sonnet on coding benchmarks

479 upvotes · 275 comments

The strongest skepticism is about benchmark comparability, not whether ATLAS is interesting. The fresh discussion says today's comments mostly sharpened the point that this is “not apples-to-apples,” because ATLAS uses a long-running harness with multi-pass repair and different task handling, while frontier-model leaderboard numbers are often presented as single-shot results.

Hacker News

Discussion around $500 GPU outperforms Claude Sonnet on coding benchmarks

479 upvotes · 275 comments

That criticism gets specific in the HN discussion: Aurornis says the setup uses “best-of-3, Lens selection, and iterative repair,” so the score should not be read as a controlled head-to-head against Claude Sonnet. Another commenter argued the learned “cost field” looks like a difficulty classifier trained on English task descriptions and then applied to Python code embeddings, which may miss the difference between wrong-simple and correct-complex solutions HN discussion. A separate thread also notes that for local agentic coding, Q4 is often the practical floor, because heavier quantization can degrade results HN discussion.

TL;DR

What actually drove the score

Why engineers are pushing back on the comparison

Discussion across the web