Community testers report Gemma 4 26B-A4B hitting about 40 tok/s on an M1 Max, while a separate RTX 4090 comparison scored it 13-5 over Qwen 3.5 27B. The results add new local throughput and task-quality data beyond the launch materials.

DeepMind's Gemma 4 launch post and model page are the official story. The more useful bits are in the follow-on material: Hugging Face's deployment writeup says the family already landed across transformers, llama.cpp, MLX, WebGPU, and Mistral.rs; the main HN thread added early Mac throughput numbers and quantization tips; the RTX 4090 Reddit comparison is rougher, but it adds task-by-task local quality data that the launch materials did not.
Gemma 4 — Google DeepMind
1.8k upvotes · 473 comments
The official package is unusually local-first for a flagship open release. According to DeepMind's model page, the line splits cleanly by deployment tier: E2B and E4B for edge and mobile, 26B-A4B as the MoE middleweight, and 31B as the dense quality-first checkpoint.
Hugging Face's release post fills in the practical details. It lists 128K context for E2B and E4B, 256K for 26B-A4B and 31B, and says the small models take audio input while the whole family supports image and text input with text output.
Discussion around Google releases Gemma 4 open models
1.8k upvotes · 473 comments
The most concrete post-launch datapoint in the evidence pool is the M1 Max claim. In the HN discussion digest, a commenter said the 26B-A4B model was "head and shoulders above" recent open-weight models on an M1 Max, with roughly 40 tok/s in a Claude Code-style harness and large-context prompts.
That lines up with the architectural pitch. Hugging Face's deployment note describes 26B-A4B as a mixture-of-experts model, and the thread repeatedly treats it as the local sweet spot because only a smaller active slice runs per token.
The Reddit test is not a standardized benchmark, but it is more specific than a vibes post. The author says they ran 18 repeated business and operator tasks on a local workstation with an RTX 4090, i9-14900KF, 64 GB RAM, Ubuntu 25.10, and Ollama, then scored Gemma 4 26B over Qwen 3.5 27B by 13 to 5.
The claimed win pattern is useful because it separates speed from output behavior. In that writeup, Gemma won summaries, positioning, objections, hooks, campaign rounds, and a technical blueprint test, while Qwen's wins clustered around broader synthesis, emotional framing, and JSON compilation.
Google releases Gemma 4 open models
1.8k upvotes · 473 comments
Fresh discussion on Google releases Gemma 4 open models
1.8k upvotes · 473 comments
The HN core summary captures where the conversation landed fast: throughput, context handling, and which stack to run. It specifically names llama-server, LiteRT-LM, Ollama, and vendor runtimes as the comparison layer engineers cared about.
One HN commenter also pointed people to community quants and sampling settings on Hugging Face and Unsloth, while the latest thread delta says newer comments were mostly about acceptable latency in conversational use and day-to-day deployment UX. That is a pretty good sign Gemma 4 already moved from launch artifact to tuning target.