updateMay 17, 2026

Practitioners benchmark Qwen3.6 and Gemma 4 at 40-65 tok/s on M3 Ultra, iPhone 17 Pro, and 4x A4000

New reports show Qwen3.6 and Gemma 4 running locally across Apple and Nvidia setups, with wide variance tied to context length, runtime choice, and MTP tuning. This matters because the latest open models are reaching usable agent speeds on consumer hardware, but prefill and long-context performance still cap throughput.

5 min read

Practitioners benchmark Qwen3.6 and Gemma 4 at 40-65 tok/s on M3 Ultra, iPhone 17 Pro, and 4x A4000

TL;DR

JamieAndLion's M3 Ultra post shows why local benchmark screenshots can mislead: at 128K to 400K context, an M3 Ultra reported just 9.7 tok/s on Qwen3.6-27B but 45.7 tok/s on the 35B-A3B MoE, and commenters tied much of the gap to long-context prefill and runtime choice.
danielhanchen's MTP update and danielhanchen's earlier benchmark thread show Qwen3.6 MTP configs changing within days, with reported speedups moving from about 1.4x to 1.8x after new llama.cpp flags and a rename from --spec-type mtp to --spec-type draft-mtp.
On older Nvidia hardware, Alternative_Ad4267's four-A4000 setup reported 45 to 65 tok/s on Qwen3.6-27B Q8 with MTP, while SilverBoko's P100 benchmarks put Qwen3.6-35B-A3B around 41 to 47 tok/s generation and dense Gemma 4 31B around 8.7 to 9.5 tok/s.
rohanpaul_ai's iPhone demo put Gemma 4 E2B at roughly 40 tok/s on an iPhone 17 Pro, while vllm_project's v0.21.0 release thread added Gemma 4 MTP support to vLLM alongside reasoning-aware speculative decoding.
Conscious-Track5313's Mac sandbox post and Strange_Test7665's parallel llama.cpp post show the next layer of local experimentation: not just chat, but persistent Linux sandboxes, shared workspaces, and multi-user serving on mixed consumer GPUs.

You can watch the iPhone 17 Pro demo of Gemma 4 coding at phone speeds, dig through Google's Gemma 4 model page, check Unsloth's Qwen 3.6 MTP guide, and read the vLLM v0.21.0 notes for the serving-side changes that now mention Gemma 4 MTP directly.

M3 Ultra and context length

r/localLLM

M3 Ultra Mac feels rather slow

10 comments

The cleanest reality check in the batch came from JamieAndLion's numbers, because they included actual context sizes. Qwen3.6-27B-UD-MLX-4bit managed 159.4 tok/s prompt processing and 9.7 tok/s generation, while Qwen3.6-35B-A3B-UD-MLX-4bit hit 790.7 tok/s prompt processing and 45.7 tok/s generation.

The comments converged on two explanations. According to the same thread, 128K to 400K context is large enough to crush the headline tok/s people quote for short prompts, and one commenter said oMLX was slower than llama.cpp on the same MLX model files.

That lines up with oMLX's own compare page, which is built around context-sensitive hardware comparisons rather than single short-context brag numbers.

MTP is moving faster than the benchmarks

The most annoying part of comparing Qwen3.6 local runs right now is that the recommended flags changed almost immediately. In the earlier thread, danielhanchen said Qwen3.6 27B MTP ran at 140 tok/s and Qwen3.6 35B-A3B MTP reached 220 tok/s generation on a single GPU, with average gains around 1.4x for dense models and 1.15 to 1.2x for the MoE.

Two days later, the follow-up update said the same line had moved to 1.8x faster after llama.cpp added --spec-draft-p-min 0.75. That post also changed the recommended invocation from --spec-type mtp to --spec-type draft-mtp, pushed --spec-draft-n-max upward, and noted that setting --spec-draft-p-min 0.0 restores the older behavior for people seeing regressions.

The official paper trail is public: the llama.cpp PR and Unsloth's MTP guide are now part of the benchmark itself, not background reading.

Older Nvidia rigs are still viable

r/LocalLLaMA

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

3 comments

r/localLLM

New LLMs on old hardware

0 comments

Two practitioner posts landed on the same conclusion from very different hardware generations: MoE variants are making old boxes look better than dense models.

From Alternative_Ad4267's four RTX A4000 run:

Qwen3.6-27B Q8 with MTP delivered about 45 tok/s for reasoning and 60-plus tok/s for coding.
Qwen3.6-35B-A3B Q8 MoE reached about 80 to 90 tok/s.
The same author said configuration changes moved the 27B setup from roughly 12 tok/s to 45 to 65 tok/s.

From SilverBoko's dual-P100 benchmarks on a Dell R730:

Qwen3.6-35B-A3B generated at 41.4 to 47.2 tok/s across prompt sizes from 1.5K to 17.5K.
Qwen3.6-27B dense sat near 10.4 to 11.2 tok/s generation.
Gemma 4 31B-it dense sat near 8.7 to 9.5 tok/s generation.

That gap also matches the complaint in the M3 Ultra thread, where the MoE model looked much healthier than the dense one under long context.

Gemma 4 is already showing up on phones and runtimes

The most fun data point here is still the iPhone 17 Pro clip, which claimed Gemma 4 E2B running fully offline at about 40 tok/s with MLX, 128K context, and thinking mode.

The serving stack around Gemma 4 also moved. In vLLM's release thread, the project said v0.21.0 adds Gemma 4 MTP support, reasoning-budget-aware speculative decoding, and a long list of hardware-specific kernels across Blackwell, ROCm, Intel XPU, CPU, and IBM Power.

Community discussion around the original Gemma 4 release stayed focused on exactly these deployment details. According to the HN discussion summary, one top commenter reported the 26B-A4B variant at roughly 40 tok/s with 37K context on an M1 Max in a Claude Code style harness, while another pointed readers to LiteRT-LM and Modular MAX for faster local and edge runtimes.

Sandboxes and parallel users

r/localLLM

Running Linux sandbox as tool for AI models on Mac - no Docker, no remote VMs, all inside single app

0 comments

r/LocalLLaMA

local llama.cpp parallel users - still so fast?!

0 comments

The most interesting posts were not pure benchmarks. Conscious-Track5313's write-up described a local agent app on macOS 26 that uses Apple's open source Containerization framework to boot an Alpine VM in about six seconds, mount the project at /workspace, persist package installs across sessions, and expose both a run_command tool and a live terminal inside the same app.

Strange_Test7665's llama.cpp post hit a different practical question: a mixed 5090 plus 5060 rig serving Qwen3.6-27B Q8 at about 30 tok/s for one user still managed about 24 tok/s across three concurrent users. Local model work is drifting away from single-chat vanity numbers and toward agent harnesses, shared sandboxes, and small multi-user services.

TL;DR

M3 Ultra and context length

MTP is moving faster than the benchmarks

Older Nvidia rigs are still viable

Gemma 4 is already showing up on phones and runtimes

Sandboxes and parallel users

Discussion across the web