breakingMarch 28, 2026

ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti

The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.

Benchmarks Coding Agents Evals

3 min read

ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti

TL;DR

ATLAS reports that a frozen Qwen3-14B Q4 model on one RTX 5060 Ti 16GB reached 74.6% pass@1-v(k=3) on LiveCodeBench v5, with no fine-tuning or API calls, according to the ATLAS repo and the HN summary.
The result comes from harness design more than base-model changes: ATLAS stacks constraint-driven generation, self-verified iterative refinement, PlanSearch, and PR-CoT repair around the model, as described in the project page.
Hacker News discussion says this is not a clean one-shot comparison to Claude Sonnet or GPT-style benchmark scores, because ATLAS uses “best-of-3,” selection, and iterative repair in a longer-running loop HN discussion.
The practical takeaway from the thread is that a local 14B model can be amplified for coding tasks by an asynchronous test-and-repair harness, though commenters also questioned parts of the methodology and the learned “cost field” HN summary fresh delta.

What actually drove the score

Hacker Newspage479 points275 comments

itigges22/ATLAS: Adaptive Test-time Learning and Autonomous Specialization

Posted by yogthos

ATLAS is an open-source project that enables a frozen 14B model (Qwen3-14B-Q4_K_M) on a single consumer GPU (RTX 5060 Ti 16GB) to achieve 74.6% pass@1-v(k=3) on LiveCodeBench v5 through constraint-driven generation, self-verified iterative refinement, PlanSearch, and PR-CoT repair. No fine-tuning or API calls required; fully self-hosted. Benchmarks: 47.0% GPQA Diamond, 14.7% SciCode. Stars: 1044. Created Feb 2026.

Open linked page Open HN thread

ATLAS is a self-hosted coding harness, not a new fine-tuned base model. The project page says it runs Qwen3-14B-Q4_K_M on a single consumer GPU and reaches 74.6% on LiveCodeBench v5 by combining constraint-driven generation, “self-verified iterative refinement,” PlanSearch, and PR-CoT repair. The same page reports 47.0% on GPQA Diamond and 14.7% on SciCode.

The HN summary frames the core engineering point more narrowly than the viral headline: a frozen mid-sized model can “compete with hosted frontier systems” when wrapped in a multi-pass harness that generates several candidates, tests them, and repairs failures. One HN commenter summarized it as a system that “generates multiple solutions, tests them, and repairs failures over time,” which makes it more suited to asynchronous coding workflows than interactive streaming output HN summary.

Why engineers are pushing back on the comparison

Hacker Newsdelta479 points275 comments

Fresh discussion on $500 GPU outperforms Claude Sonnet on coding benchmarks

Posted by yogthos

Today’s discussion mostly sharpened skepticism about the benchmark methodology rather than adding new evidence about ATLAS itself. Several commenters argued the comparison is not apples-to-apples because ATLAS uses a long-running harness with multi-pass repair and different task sets, so its scores are not directly comparable to single-shot frontier-model results. A second thread focused on whether the internals make sense at all: one commenter questioned the “cost field”/difficulty-classifier approach as operating on the wrong distribution, while another noted that local-model quality can depend heavily on quantization and that Q4 is often the minimum practical floor for coding workloads.

Open HN thread Open HN thread

The strongest skepticism is about benchmark comparability, not whether ATLAS is interesting. The fresh discussion says today's comments mostly sharpened the point that this is “not apples-to-apples,” because ATLAS uses a long-running harness with multi-pass repair and different task handling, while frontier-model leaderboard numbers are often presented as single-shot results.

Hacker Newsdiscussion479 points275 comments

Discussion around $500 GPU outperforms Claude Sonnet on coding benchmarks

Posted by yogthos

Thread discussion highlights: - Aurornis on benchmark methodology: ATLAS uses a different LiveCodeBench methodology, with best-of-3, Lens selection, and iterative repair on a frozen 14B model, so it is not a controlled head-to-head against single-shot Claude Sonnet or GPT-style scores. - xyzzy123 on cost/difficulty field: The commenter says the learned “cost field” looks like a difficulty classifier over English task descriptions, then gets applied to Python code embeddings, which may not help solve harder problems or distinguish wrong-but-simple from correct-but-complex solutions. - mongrelion on quantization and VRAM: An AMD GPU owner says Q4 seems like the lowest usable quantization for agentic coding, because going below that starts hurting results; they also point to Hugging Face’s GPU-fit indicators and Qwen3.5 as a strong local model family.

Discussed by

Aurornis on benchmark methodology
xyzzy123 on cost/difficulty field
mongrelion on quantization and VRAM

Open HN thread Open HN thread

That criticism gets specific in the HN discussion: Aurornis says the setup uses “best-of-3, Lens selection, and iterative repair,” so the score should not be read as a controlled head-to-head against Claude Sonnet. Another commenter argued the learned “cost field” looks like a difficulty classifier trained on English task descriptions and then applied to Python code embeddings, which may miss the difference between wrong-simple and correct-complex solutions HN discussion. A separate thread also notes that for local agentic coding, Q4 is often the practical floor, because heavier quantization can degrade results HN discussion.

🧾 More sources

Hacker Newscore479 points275 comments

$500 GPU outperforms Claude Sonnet on coding benchmarks

Posted by yogthos

Useful if you care about local coding agents, benchmark design, and what it takes for a frozen mid-sized model to compete with hosted frontier systems. The thread suggests the main takeaway is not that a $500 GPU beats Claude in a fair one-shot comparison, but that a multi-pass harness can amplify a local model on coding tasks.

Discussed by

Aurornis on benchmark methodology
xyzzy123 on cost/difficulty field
mongrelion on quantization and VRAM

Open HN thread Open HN thread