Skip to content
AI Primer
release

Unsloth updates Qwen3.5 MTP GGUFs with draft-mtp flags for 1.8x speed

Unsloth said its updated Qwen3.5 MTP GGUFs now run about 1.8x faster after llama.cpp added spec-draft-p-min 0.75 and renamed the mode to draft-mtp. The update also raises draft-token settings and expands the small-model MTP set for local runners.

3 min read
Unsloth updates Qwen3.5 MTP GGUFs with draft-mtp flags for 1.8x speed
Unsloth updates Qwen3.5 MTP GGUFs with draft-mtp flags for 1.8x speed

TL;DR

You can jump straight to Unsloth's MTP guide, inspect the exact llama.cpp change in PR #22673, and the weirdest detail is that the rollout landed with a naming correction: the small MTP quants initially described as Qwen3.6 were later corrected by danielhanchen's follow-up to Qwen3.5.

Speedup jump

The before-and-after here is unusually concrete. The May 13 post put average MTP speedup around 1.4x for dense models and 1.15 to 1.2x for the MoE model, with 140 tokens/s on Qwen3.6 27B MTP and 220 tokens/s on 35B-A3B MTP.

Two days later, danielhanchen's update said the same setup now reaches about 1.8x. The stated reason is a new acceptance threshold, --spec-draft-p-min 0.75, added in llama.cpp PR #22673.

The original release also came with a limit: the May 13 benchmarks said Unsloth did not recommend more than 2 draft tokens because acceptance dropped from 83% to 50% at 4 draft tokens.

Flags and fallback behavior

The updated branch changed the command surface in three places:

There is also a rollback knob. For users seeing regressions on the updated branch, danielhanchen's update says setting --spec-draft-p-min back to 0.0 restores the old behavior.

Model coverage and the naming correction

The first release centered on two bigger checkpoints, Qwen3.6 27B and Qwen3.6 35B-A3B, in the May 13 announcement. The May 15 update added small-model MTP GGUFs, but danielhanchen's correction says those are actually Qwen3.5 0.8B, 2B, 4B, and 9B.

That same correction also sketches the next batch: the follow-up says Unsloth is working on Qwen3.5-122B and Qwen3.5-397B MTP variants. For local runners, that makes this story less about one benchmark jump and more about MTP support spreading across the Qwen GGUF lineup.

Share on X