updateApril 6, 2026

Gemma 4 26B benchmarks at ~40 tok/s on M1 Max in community tests

Community testers report Gemma 4 26B-A4B hitting about 40 tok/s on an M1 Max, while a separate RTX 4090 comparison scored it 13-5 over Qwen 3.5 27B. The results add new local throughput and task-quality data beyond the launch materials.

3 min read

Gemma 4 26B benchmarks at ~40 tok/s on M1 Max in community tests

TL;DR

DeepMind's launch page says Gemma 4 ships in four sizes, E2B, E4B, 26B-A4B MoE, and 31B dense, with Apache 2.0 licensing, multimodal support, and a push toward agentic workflows.
In the main Hacker News thread, one early tester said the 26B-A4B variant was hitting about 40 tok/s on an M1 Max in a Claude Code-style harness with large-context prompts.
A separate Reddit head-to-head ran 18 local business tasks on an RTX 4090 and scored Gemma 4 26B over Qwen 3.5 27B by 13 wins to 5.
HN commenters also surfaced the practical release story, namely community quants, local runtimes, and small-model agentic tests, while the fresh HN delta shows the discussion shifting from launch hype to latency and usability.

DeepMind's Gemma 4 launch post and model page are the official story. The more useful bits are in the follow-on material: Hugging Face's deployment writeup says the family already landed across transformers, llama.cpp, MLX, WebGPU, and Mistral.rs; the main HN thread added early Mac throughput numbers and quantization tips; the RTX 4090 Reddit comparison is rougher, but it adds task-by-task local quality data that the launch materials did not.

Gemma 4's local shape

Hacker News

Gemma 4 — Google DeepMind

1.8k upvotes · 473 comments

The official package is unusually local-first for a flagship open release. According to DeepMind's model page, the line splits cleanly by deployment tier: E2B and E4B for edge and mobile, 26B-A4B as the MoE middleweight, and 31B as the dense quality-first checkpoint.

Hugging Face's release post fills in the practical details. It lists 128K context for E2B and E4B, 256K for 26B-A4B and 31B, and says the small models take audio input while the whole family supports image and text input with text output.

The 40 tok/s M1 Max report

Hacker News

Discussion around Google releases Gemma 4 open models

1.8k upvotes · 473 comments

The most concrete post-launch datapoint in the evidence pool is the M1 Max claim. In the HN discussion digest, a commenter said the 26B-A4B model was "head and shoulders above" recent open-weight models on an M1 Max, with roughly 40 tok/s in a Claude Code-style harness and large-context prompts.

That lines up with the architectural pitch. Hugging Face's deployment note describes 26B-A4B as a mixture-of-experts model, and the thread repeatedly treats it as the local sweet spot because only a smaller active slice runs per token.

The RTX 4090 comparison favored Gemma on discipline

The Reddit test is not a standardized benchmark, but it is more specific than a vibes post. The author says they ran 18 repeated business and operator tasks on a local workstation with an RTX 4090, i9-14900KF, 64 GB RAM, Ubuntu 25.10, and Ollama, then scored Gemma 4 26B over Qwen 3.5 27B by 13 to 5.

The claimed win pattern is useful because it separates speed from output behavior. In that writeup, Gemma won summaries, positioning, objections, hooks, campaign rounds, and a technical blueprint test, while Qwen's wins clustered around broader synthesis, emotional framing, and JSON compilation.

Community testing is already about runtimes, not just benchmarks

Hacker News

Google releases Gemma 4 open models

1.8k upvotes · 473 comments

Hacker News

Fresh discussion on Google releases Gemma 4 open models

1.8k upvotes · 473 comments

The HN core summary captures where the conversation landed fast: throughput, context handling, and which stack to run. It specifically names llama-server, LiteRT-LM, Ollama, and vendor runtimes as the comparison layer engineers cared about.

One HN commenter also pointed people to community quants and sampling settings on Hugging Face and Unsloth, while the latest thread delta says newer comments were mostly about acceptable latency in conversational use and day-to-day deployment UX. That is a pretty good sign Gemma 4 already moved from launch artifact to tuning target.