Skip to content
AI Primer
breaking

Gemma 4 MTP benchmarks 138 tok/s in llama.cpp on M5 Max

Community ports brought Gemma 4 multi-token prediction into llama.cpp and MLX Swift, with one M5 Max report moving from 97 to 138 tok/s and another showing 30-40% faster decoding. The gains extend MTP into local runtimes used for on-device coding and long-context work.

4 min read
Gemma 4 MTP benchmarks 138 tok/s in llama.cpp on M5 Max
Gemma 4 MTP benchmarks 138 tok/s in llama.cpp on M5 Max

TL;DR

You can browse Google's Gemma 4 model page, inspect the TurboQuant llama.cpp fork linked by testingcatalog's follow-up, and peek at the draft llama.cpp PR #22673 that the Hugging Face repost screenshot captured. There is also an early MLX Swift port, which martinbowling's repost says is already showing 30 to 40 percent faster decoding on Gemma 31B.

llama.cpp

The cleanest datapoint comes from Apple Silicon. In WesRoth's clip, Gemma 4 moves from 97 tok/s to 138 tok/s on an M5 Max, and testingcatalog's post says the setup used quantized Gemma 4 assistant models in GGUF inside a patched llama.cpp flow.

The linked TurboQuant fork frames Gemma 4 MTP as a 30 to 50 percent short-prompt throughput gain, which lines up with the tweet-level reports instead of wildly exceeding them.

PR design

The interesting bit is how the draft support is wired. Clement Delangue's screenshot of PR #22673 shows the MTP head as a separate model loaded from the same GGUF, while the Hugging Face repost screenshot adds that the implementation keeps its own context and KV cache and introduces a hook so hidden features propagate after each microbatch.

That same screenshot says the author saw roughly 75 percent steady-state acceptance with three draft tokens on Qwen tests, claiming more than 2x over baseline there. The Gemma numbers circulating this week are lower, which makes the 97 to 138 tok/s report look like an early real-world port result, not a cherry-picked upper bound.

MLX Swift

The speedup is not confined to one runtime. martinbowling's repost of Adrien Grondin says an early MLX Swift port has Gemma 31B running 30 to 40 percent faster on M5 Max, which suggests the gain is tied to Gemma 4's MTP setup rather than one fork's local trick.

The ecosystem timing matters here because Gemma 4 is already being pushed into consumer-local workflows. the HN thread is full of people testing it in llama-server, Ollama, LM Studio, LiteRT-LM, and on Macs and Raspberry Pi, while the discussion digest calls out one M1 Max user getting about 40 tok/s on the 26B-A4B variant with a 37K-token initial context.

Long context

The last useful reveal is where people want this headroom to go. Clement Delangue's repost of Garry Tan highlighted Gemma 4's claimed 1M-token context on a 128 GB MacBook Pro, and rohanpaul_ai's explainer notes why llama.cpp support spreads fast: GGUF-backed features tend to get pulled into desktop apps, coding agents, and private on-device assistants.

That does not prove long-context agent workloads will scale linearly with MTP, but it does explain why this patch got immediate attention. Gemma 4 already had the local-model audience, and MTP gives that audience a cheaper way to trade assistant heads for more usable throughput.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

Share on X