Skip to content
AI Primer
release

llama.cpp adds MTP for Qwen3.6 and pushes local 27B decode to 70-160 tok/s

llama.cpp added multi-token prediction for the Qwen3.6 family, and Unsloth published MTP GGUFs claiming about 1.4-2.2x faster local generation. The update moves Qwen3.6 closer to daily-driver speeds on commodity hardware, though results still vary by ROCm build and quant.

5 min read
llama.cpp adds MTP for Qwen3.6 and pushes local 27B decode to 70-160 tok/s
llama.cpp adds MTP for Qwen3.6 and pushes local 27B decode to 70-160 tok/s

TL;DR

You can browse the llama.cpp codebase, pull Unsloth's MTP GGUFs, and follow the exact Qwen3.6 MTP setup guide. The fun part is how quickly the reports converged: UnslothAI's demo clip showed nearly 100 tok/s on a 4-bit H100 run, one LocalLLaMA setup claimed full-context Q8 inference across four older RTX A4000s, and another LocalLLaMA thread turned into a very specific ROCm debugging session about why flash attention was crashing on gfx1030 and gfx1031.

MTP lands in llama.cpp

The core news is small and consequential. llama.cpp now supports MTP for Qwen3.6, which lets the runtime speculate multiple next tokens instead of stepping one token at a time.

Unsloth immediately shipped matching GGUFs and positioned the gain in plain terms: about 1.4 to 2.2x faster generation with no accuracy change, plus a 27B variant that can run locally in about 18GB RAM. That is Christmas-come-early-for-local-inference people because it moves Qwen3.6 closer to the point where "daily driver" stops sounding aspirational and starts sounding normal.

The numbers people are posting

The early performance claims break into a few distinct buckets:

  • UnslothAI's release thread said Qwen3.6-27B MTP runs at 160 tok/s.
  • The same thread said Qwen3.6-35B-A3B reaches 240 tok/s.
  • In UnslothAI's follow-up demo, a 4-bit Qwen3.6-27B-MTP-GGUF hit 96.4 tok/s in Unsloth Studio on an H100.
  • A LocalLLaMA four-GPU report said Qwen 3.6 27B Q8 with --spec-draft-n-max 4 ran at 45-ish tok/s for reasoning and in the 60s for coding on four RTX A4000 cards.
  • The same LocalLLaMA post said Qwen 3.6 35B-A3B Q8 reached about 80 to 90 tok/s when split by layer mode.
  • A localLLM benchmark post measured Qwen3.6-35B-A3B Q4 at 50.95 tok/s decode and 327.6 tok/s prefill on a 24GB Intel Battlemage B60-class card at 8K context.

These are not apples-to-apples. Quant, backend, context length, KV cache settings, and whether the model is dense or A3B all move the result. But they do point in one direction: the post-MTP floor for usable local Qwen3.6 is noticeably higher than the pre-MTP vibe.

The knobs that matter

r/LocalLLaMA

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

3 comments

The interesting part of the community reports is how much of the gain comes from very specific runtime choices, not just from downloading a newer file.

A few knobs show up repeatedly:

That is why the rollout looks less like one benchmark chart and more like a new bag of tricks for people already living in llama.cpp.

ROCm is still the messy part

r/LocalLLaMA

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

10 comments

The clean speedup story breaks on AMD software stacks. One LocalLLaMA user said stock ROCm builds crashed when flash attention was enabled on RDNA2, then published a custom binary that bypassed the failing assert and reported decode moving from about 30 tok/s on Vulkan to 70 to 80 tok/s on the patched ROCm build.

The comments matter because they complicate the first claim instead of simply cheering it on. In that same thread, another user argued the workaround was patching around an old ROCm 7.1.1 bug, not a current llama.cpp limitation, and said newer ROCm versions plus newer llama.cpp builds already fixed parts of the MTP prompt-processing path.

So the practical read is narrower than the headline. MTP support shipped, but whether it feels turnkey still depends on your backend, driver age, and how lucky you get with flash attention.

Older cards suddenly look less old

r/LocalLLaMA

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

3 comments

The most useful late reveal is not the headline benchmark, it is the hardware recycling story around it. One LocalLLaMA user said a server with four 16GB RTX A4000 cards went from about 12 tok/s before the right MTP setup to 45 to 65 tok/s afterward, enough to make a previously regretted box feel usable again.

That same tone showed up elsewhere. mervenoyann said the update was enough to revisit a Pi and Hermes setup, while a retweeted reaction from victormustar framed MTP-enabled llama.cpp as the point where local models start feeling fast enough for daily-driver use. The hype is still benchmark-shaped, but the new fact is simpler: MTP is making old local hardware worth another pass.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
Older cards suddenly look less old1 post
Share on X