releaseMay 5, 2026

Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

5 min read

Gemma 4 adds MTP drafters for up to 3x faster decoding

TL;DR

Google shipped Multi-Token Prediction drafters for all four Gemma 4 checkpoints, and osanseviero's launch post says they deliver up to 3x faster decoding with the same output quality.
According to WesRoth's summary and Google's announcement, the speedup comes from speculative decoding, where a lightweight drafter proposes several tokens and the full model verifies them in parallel.
Day-one runtime support landed fast: lmsysorg's SGLang post and vLLM's launch post both shipped ready-to-run Gemma 4 MTP configs on release day.
The under-the-hood trick is cache reuse, not a separate mini model bolted on the side. Hugging Face's v5.8.0 release says the Gemma 4 Assistant reuses the target model's KV cache and skips prefill entirely.
The rollout already spans local and edge surfaces: osanseviero's ecosystem list names Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge Gallery, while Ollama v0.23.1 added a Mac-ready gemma4:31b-coding-mtp-bf16 build the same day.

You can read Google's post, skim the Transformers release note for the assistant-model mechanics, and grab ready-made recipes from SGLang, vLLM, and Ollama. Google also buried one useful caveat in the blog: the 26B MoE can speed up more on Apple Silicon and A100 once batch size rises above 1.

Speculative decoding

Google frames the bottleneck plainly in its announcement: normal autoregressive decoding is memory-bandwidth bound, so hardware spends most of its time moving weights just to emit one token.

MTP splits that path in two. A lighter drafter predicts several next tokens, then the full Gemma 4 model verifies that whole chunk in one pass. If the draft is accepted, the model can emit the drafted sequence plus one extra token in roughly the time a normal decode step would have produced one.

Speedups by hardware

The launch chart was not a single headline number. It broke out by model and hardware:

E2B on Samsung S26 mobile GPU: up to 1.8x, per testingcatalog's chart post
E4B on Samsung S26 mobile GPU: up to 2.2x, per testingcatalog's chart post
E2B on Pixel TPU: up to 2.8x, per testingcatalog's chart post
E4B on Pixel TPU: up to 3.1x, per testingcatalog's chart post
31B on Apple M4: up to 2.5x, per testingcatalog's chart post
26B on NVIDIA A100: up to 1.5x, per testingcatalog's chart post
31B on NVIDIA A100: up to 3.0x, per testingcatalog's chart post

The same Google post adds a useful qualifier: batch size 1 is a rougher fit for the 26B MoE on Apple Silicon, but batch sizes 4 to 8 can push local speedups to about 2.2x, with similar gains on A100 once concurrency rises.

Day-one runtimes

This was a real rollout, not just model weights. osanseviero's ecosystem list put Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge Gallery on the initial support list.

The SGLang cookbook exposes the serving knobs directly in its Gemma 4 docs: --speculative-algorithm NEXTN, a paired *-assistant drafter path, --speculative-num-steps 5, and --speculative-num-draft-tokens 6. For the 31B and 26B-A4B variants, lmsysorg's command screenshot shows tensor parallel settings, and the docs note that 26B-A4B requires --tp 2 when MTP is enabled.

vLLM shipped a Docker image and recipe on day zero. In [vLLM's example screenshot](src:3|vLLM's launch post), the serve command wires Gemma 4 into auto tool choice, the gemma4 tool parser, and a speculative config block for draft decoding, which matters because Gemma 4's MTP path landed alongside the model family's tool-calling and multimodal surfaces.

Assistant checkpoints

The assistant side of this release is more specialized than the marketing suggests. The Transformers v5.8.0 note describes Gemma 4 Assistant as a small text-only model that keeps the same Gemma4TextModel backbone, reuses the target model's KV cache across the whole network, and skips prefill entirely.

That release note also says the assistant uses cross-attention to the target model's context so it can draft more tokens per round. In parallel, a vLLM pull request says E2B and E4B assistants use centroid masking to shrink lm_head work from about 262K vocabulary candidates to about 4K, which is an unusually concrete hint at where Google's edge gains are coming from.

Local and edge surfaces

Google's announcement says the same Apache 2.0 licensing carries over to the MTP drafters, with weights on Hugging Face and Kaggle and direct use in AI Edge Gallery on Android and iOS.

The Mac path showed up immediately too. Ollama v0.23.1 added Gemma 4 MTP support for its MLX runner, claimed more than 2x speedup for 31B on coding tasks, and published a concrete slug, gemma4:31b-coding-mtp-bf16, for anyone who wanted the shortest route from announcement to terminal.