breakingMay 10, 2026

Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

6 min read

Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

TL;DR

teortaxesTex's repost of cheenanet put Qwen3.6-27B at 120 to 200 tok/s on a single RTX 3090, while nummanali claimed 70+ tok/s from Qwen 3.6 and Gemma 4 on 32 GB+ MacBook Pros using MPS.
LLMpsycho's ds4 post and the ds4 repository frame DeepSeek V4 Flash as a model-specific Metal runtime for Apple Silicon, with a local CLI, an OpenAI-compatible server, and disk-backed KV cache for long context on 128 GB+ Macs.
niallohiggins called DS4 on a 128 GB MacBook Pro Max the first local LLM that felt worth running day to day, and a LocalLLaMA user upgrading from V3 to V4 said long context and multi-file refactors improved enough to shift real codebase work onto V4 Flash.
testingcatalog and itsPaulAi highlighted Multi-Token Prediction patches around Gemma 4, with reports of roughly 40 percent to 1.5x faster local runs on Apple Silicon.
ClementDelangue said Hugging Face is now hosting 176,000 public GGUF models, with monthly creation jumping from about 5.1K in Oct to Feb to 9.2K in March and April, which helps explain why local inference suddenly feels less niche.

You can browse ds4, skim the Lucebox hub repo behind the 3090 Qwen post, and check the turboquant fork behind the Gemma 4 MTP demos. The interesting bit is not one benchmark chart. It is that the reports now span three different lanes at once: MacBooks with MPS, 128 GB Macs running a 284B-class model through a custom engine, and single 3090 setups that cross 100 tok/s on smaller coding-capable models.

Throughput bands

The evidence cluster has three practical speed bands, and each one maps to a different local setup.

Single 3090: teortaxesTex's repost of cheenanet claimed Qwen3.6-27B at 120 to 200 tok/s, linking to Lucebox Hub.
MacBook Pro with MPS: nummanali said Qwen 3.6 and Gemma 4 can hit 70+ tok/s on M3, M4, and M5 MacBook Pros with 32 GB RAM or more.
128 GB Apple Silicon for bigger models: niallohiggins said DS4 made DeepSeek V4 Flash usable enough for day-to-day local work on a 128 GB MacBook Pro Max.

The numbers are not apples-to-apples. One post is about a 27B Qwen run on Nvidia, one is about MacBook-class dense models, and one is about fitting DeepSeek V4 Flash into a custom Apple Silicon path. But the common story is simple: local throughput is now landing in the range where interactive coding and chat no longer feel like patience tests.

DS4

ds4 is the most concrete part of this story because it is an actual runtime, not just a benchmark screenshot. The repo describes a specialized inference engine for DeepSeek V4 Flash on Apple Silicon, built around Metal, with its own loader, prompt rendering, KV state management, and HTTP API glue.

Between LLMpsycho's post and the repository README, the notable features are:

Local CLI for running V4 Flash directly
OpenAI-compatible server API for agent tooling
Disk-backed KV cache for long context
A hardware target that starts around 128 GB RAM on Macs
A model-specific design, not a generic runtime wrapper

That last point is the fun one. cedric_chee's linked repo summary describes DS4 as intentionally narrow, a one-model engine rather than a universal backend, which is a very different bet from llama.cpp-style generality.

Gemma 4 and MTP

Gemma 4's local story is less about raw model quality than about squeezing more speed out of the same laptop. testingcatalog pointed to a patched llama.cpp path where Multi-Token Prediction lets a smaller assistant model draft ahead, while the main model verifies tokens. itsPaulAi summarized one run as 97 tok/s without MTP and 138 tok/s with MTP.

testingcatalog described Gemma 4 26B with MTP draft tokens as about 40 percent faster on an M5 Max.
itsPaulAi posted a 1.5x speedup claim on the same laptop class, from 97 to 138 tok/s.
martinbowling's repost of adrgrondin separately cited a 30 to 40 percent gain for Gemma 31B on MLX Swift.

This is why the MacBook posts matter. The model did not change, the machine did not change, and the speed still moved a lot. Local AI progress is increasingly coming from runtime tricks, quantization work, and draft-token schemes, not just from new checkpoints.

Beyond coding evals

The most useful thread in the evidence pool is niallohiggins' Irish idiom test, because it checks something most local-model brag posts ignore: whether faster local models hold up on weird, low-resource knowledge rather than just codegen.

According to niallohiggins' criteria post, a good answer had to get six things right:

Identify the phrase as Irish
Know that gliomach means lobster
Recognize the phrase as an Irish idiom
Avoid treating it as the same idiom in Scottish Gaelic
Adapt the meaning rather than port the proverb literally
Handle dialect-sensitive grammar

The model ranking from niallohiggins' summary was blunt:

GPT-5.5, best overall
DeepSeek V4 Flash 2-bit via DS4, surprisingly close to GPT-5.5
Qwen 3.6 27B, fluent but anchored on the wrong lexical meaning
Gemma 4 26B-A4B, fluent but semantically drifted

That makes the current local-model moment look a little stranger than the usual tok/s race. Speed gains are real, but Qwen's failure case and Gemma's failure case show how easy it still is for a model to sound polished while missing the actual cultural or linguistic anchor.

Real codebase use

The LocalLLaMA discussion is where the story stops sounding like hobbyist benchmarking. the DeepSeek V3 to V4 upgrade post described three improvements and two regressions after moving real work across two production codebases.

r/LocalLLaMA

Upgraded DeepSeek V3 to V4 across two codebases. Two of my agents broke.

0 comments

The reported improvements were:

Better long-context retention past roughly 50K tokens
A usable split where Flash handled about 80 percent of refactors and stack-trace work, with Pro reserved for harder planning
More coherent multi-file refactors

The reported regressions were:

Higher sensitivity to vague prompts
Stricter tool-call schemas that broke existing agent setups

That lines up with niallohiggins' day-to-day verdict that DS4 was the first local setup worth using regularly. The community read is no longer just, can I make this run. It is, which parts of my existing workflow survive the switch to local, and which harnesses need cleanup.

GGUF supply

The supply side has changed just as fast as the runtimes. ClementDelangue said Hugging Face now hosts 176,000 public GGUF models, and his chart showed a clear step-up in monthly creation.

The monthly numbers from ClementDelangue's post break into two regimes:

October 2025 to February 2026: about 5.1K new GGUF models per month on average
March 2026: 8,749, up 55 percent month over month
April 2026: 9,729, up another 11 percent

That context matters because it explains why these local posts are suddenly everywhere. Faster runtimes help, but so does a world where quantized variants appear quickly, Apple Silicon paths get first-class attention, and the packaging layer around open weights keeps compounding.