releaseMay 18, 2026

llama.cpp adds MTP for Qwen3.6 and pushes local 27B decode to 70-160 tok/s

llama.cpp added multi-token prediction for the Qwen3.6 family, and Unsloth published MTP GGUFs claiming about 1.4-2.2x faster local generation. The update moves Qwen3.6 closer to daily-driver speeds on commodity hardware, though results still vary by ROCm build and quant.

5 min read

llama.cpp adds MTP for Qwen3.6 and pushes local 27B decode to 70-160 tok/s

TL;DR

ggerganov's llama.cpp update added multi-token prediction, or MTP, for the Qwen3.6 family, and UnslothAI's release thread paired it with new GGUFs that claimed roughly 1.4 to 2.2x faster generation.
According to UnslothAI's benchmarks, Qwen3.6-27B MTP hit 160 tok/s and Qwen3.6-35B-A3B hit 240 tok/s, while UnslothAI's H100 demo showed a 4-bit 27B GGUF running at 96.4 tok/s.
The official pieces are already live across the stack: the llama.cpp repository, Unsloth's Qwen3.6 MTP guide, and Unsloth's 27B MTP GGUF page.
Community reports suggest the speedups are real but hardware-dependent, because a four-A4000 setup jumped from roughly 12 tok/s to 45 to 65 tok/s with configuration changes, while a LocalLLaMA ROCm thread ran into flash-attention crashes on older RDNA2 software stacks.

You can browse the llama.cpp codebase, pull Unsloth's MTP GGUFs, and follow the exact Qwen3.6 MTP setup guide. The fun part is how quickly the reports converged: UnslothAI's demo clip showed nearly 100 tok/s on a 4-bit H100 run, one LocalLLaMA setup claimed full-context Q8 inference across four older RTX A4000s, and another LocalLLaMA thread turned into a very specific ROCm debugging session about why flash attention was crashing on gfx1030 and gfx1031.

MTP lands in llama.cpp

The core news is small and consequential. llama.cpp now supports MTP for Qwen3.6, which lets the runtime speculate multiple next tokens instead of stepping one token at a time.

Unsloth immediately shipped matching GGUFs and positioned the gain in plain terms: about 1.4 to 2.2x faster generation with no accuracy change, plus a 27B variant that can run locally in about 18GB RAM. That is Christmas-come-early-for-local-inference people because it moves Qwen3.6 closer to the point where "daily driver" stops sounding aspirational and starts sounding normal.

The numbers people are posting

The early performance claims break into a few distinct buckets:

UnslothAI's release thread said Qwen3.6-27B MTP runs at 160 tok/s.
The same thread said Qwen3.6-35B-A3B reaches 240 tok/s.
In UnslothAI's follow-up demo, a 4-bit Qwen3.6-27B-MTP-GGUF hit 96.4 tok/s in Unsloth Studio on an H100.
A LocalLLaMA four-GPU report said Qwen 3.6 27B Q8 with --spec-draft-n-max 4 ran at 45-ish tok/s for reasoning and in the 60s for coding on four RTX A4000 cards.
The same LocalLLaMA post said Qwen 3.6 35B-A3B Q8 reached about 80 to 90 tok/s when split by layer mode.
A localLLM benchmark post measured Qwen3.6-35B-A3B Q4 at 50.95 tok/s decode and 327.6 tok/s prefill on a 24GB Intel Battlemage B60-class card at 8K context.

These are not apples-to-apples. Quant, backend, context length, KV cache settings, and whether the model is dense or A3B all move the result. But they do point in one direction: the post-MTP floor for usable local Qwen3.6 is noticeably higher than the pre-MTP vibe.

The knobs that matter

r/LocalLLaMA

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

3 comments

The interesting part of the community reports is how much of the gain comes from very specific runtime choices, not just from downloading a newer file.

A few knobs show up repeatedly:

--spec-type mtp enables speculative decoding with the Qwen3.6 MTP variants, as the RDNA2 workaround post and the four-A4000 setup both show.
--spec-draft-n-max changes how aggressively the draft model proposes tokens. The four-A4000 setup said 4 worked best there, while the RDNA2 post used 2.
Flash attention is a major lever. The RDNA2 report tied its speedup to getting flash attention unstuck on older AMD hardware, and the Battlemage benchmark ran with flash attention on.
Quantization still decides whether a setup feels practical. The four-A4000 report used Q8 variants at full context without KV-cache quantization, while the Battlemage post used a ~21GB UD-Q4_K_M file to keep a 35B-A3B model inside 24GB VRAM.

That is why the rollout looks less like one benchmark chart and more like a new bag of tricks for people already living in llama.cpp.

ROCm is still the messy part

r/LocalLLaMA

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

10 comments

The clean speedup story breaks on AMD software stacks. One LocalLLaMA user said stock ROCm builds crashed when flash attention was enabled on RDNA2, then published a custom binary that bypassed the failing assert and reported decode moving from about 30 tok/s on Vulkan to 70 to 80 tok/s on the patched ROCm build.

The comments matter because they complicate the first claim instead of simply cheering it on. In that same thread, another user argued the workaround was patching around an old ROCm 7.1.1 bug, not a current llama.cpp limitation, and said newer ROCm versions plus newer llama.cpp builds already fixed parts of the MTP prompt-processing path.

So the practical read is narrower than the headline. MTP support shipped, but whether it feels turnkey still depends on your backend, driver age, and how lucky you get with flash attention.

Older cards suddenly look less old

r/LocalLLaMA

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

3 comments

The most useful late reveal is not the headline benchmark, it is the hardware recycling story around it. One LocalLLaMA user said a server with four 16GB RTX A4000 cards went from about 12 tok/s before the right MTP setup to 45 to 65 tok/s afterward, enough to make a previously regretted box feel usable again.

That same tone showed up elsewhere. mervenoyann said the update was enough to revisit a Pi and Hermes setup, while a retweeted reaction from victormustar framed MTP-enabled llama.cpp as the point where local models start feeling fast enough for daily-driver use. The hype is still benchmark-shaped, but the new fact is simpler: MTP is making old local hardware worth another pass.

TL;DR

MTP lands in llama.cpp

The numbers people are posting

The knobs that matter

ROCm is still the messy part

Older cards suddenly look less old

Discussion across the web