updateApril 6, 2026

Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.

4 min read

Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

TL;DR

Google DeepMind's launch page positions Gemma 4 as a four-model open family, E2B, E4B, 26B-A4B, and 31B, with multimodal support, tool use, and 140-plus languages under Apache 2.0; Official announcement adds a 256K context window and the main benchmark claims.
In the strongest practitioner datapoint so far, the HN core roundup cites a Mac code-agent test where the 26B-A4B variant ran at roughly 40 tokens per second, with the commenter calling it well ahead of recent open-weight rivals in that harness.
Small-model results also look unusually usable: according to the HN discussion digest, Gemma-4-E4B-it scored 15 out of 25 on one SQL benchmark, while the 4-bit E2B variant still reached 12 out of 25.
Setup details are already converging in the wild. The same HN roundup points to community quants plus inference guidance, including temperature 1.0, top_p 0.95, top_k 64, and an EOS token of <turn|>.
itsPaulAi's Unsloth post shows a free Google Colab flow for fine-tuning Gemma 4 through Unsloth Studio, which lines up with Unsloth's Gemma 4 docs and the project's Gemma 4 release notes.

You can jump from Google's main launch post to the more deployment-focused AI Edge writeup, skim the HN thread for real hardware notes, and then open Unsloth's Gemma 4 page if the interesting bit for you is not inference but getting a tuned variant running quickly.

Gemma 4's shape

Gemma 4 — Google DeepMind

Gemma 4 introduces Google DeepMind's most intelligent open models, built from Gemini 3 research for maximum intelligence-per-parameter. Available in E2B, E4B for mobile/IoT efficiency with audio/vision support; 26B and 31B for advanced reasoning on personal computers. Features agentic workflows, multimodal reasoning, 140+ languages, fine-tuning. Leads benchmarks like Arena AI (31B: 1452), MMMLU, MMMU Pro. Emphasizes safety with rigorous protocols. Try in Google AI Studio/Edge; download weights.

The release is unusually broad for an open family. The DeepMind page and Google's announcement split the lineup into two tiny effective models for phones and edge devices, one 26B Mixture-of-Experts model, and one 31B dense model.

Google is also pushing Gemma 4 as more than a chat model. The official materials emphasize multimodal input, agentic workflows, and local deployment paths, while the companion Google Developers post highlights AICore, Google AI Edge, and LiteRT-LM as the intended on-device stack.

The Mac harness number engineers actually cared about

Discussion around Google releases Gemma 4 open models

Thread discussion highlights: - d4rkp4ttern on code-agent speed on Mac: For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code... the 26B-A4B variant is head and shoulders above recent open-weight models... ~40 tokens/sec... far better than Qwen3.5-35B-A3B. - nl on small-model benchmark results: Gemma-4-E4B-it scored 15/25 on my sql-benchmark... the naming is a bit odd... E4B is "4.5B effective, 8B with embeddings"... Gemma-4-E2B (4bit quant) scored 12/25... That's a great score for a small model. - danielhanchen on quants and setup guidance: Thinking / reasoning + multimodal + tool calling. We made some quants at Hugging Face for folks to run them... Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>".

The headline datapoint came from a practitioner test, not from a benchmark chart. In the HN summary, commenter d4rkp4ttern said the 26B-A4B model delivered about 40 tokens per second in a Claude Code style harness on Mac hardware, and called it clearly faster than Qwen3.5-35B-A3B in that setup.

That matters mostly because the workload is closer to how engineers abuse local models in practice: long iterative generations, tool calls, and constant prompt churn. The same discussion digest also notes that people were immediately comparing runtimes, including Ollama, llama-server, LiteRT-LM, and Modular MAX, instead of only repeating the launch benchmarks.

Small variants and quantized setups look stronger than expected

Google releases Gemma 4 open models

Gemma 4 is relevant as a new open model family for deployment and experimentation: commenters are testing its speed in agent harnesses, comparing it to Qwen-class models, sharing quantized builds, and discussing runtimes like Ollama, LiteRT-LM, llama-server, and Modular MAX. The useful signal is about how these models perform in real local/edge workflows rather than just benchmark headlines.

Two practical details surfaced fast in the thread:

HN discussion cites an independent SQL-style test where E4B-it scored 15 out of 25.
The same source says the 4-bit E2B variant still managed 12 out of 25.
HN core links community quants from Unsloth and records recommended sampling settings: temperature 1.0, top_p 0.95, top_k 64.
Unsloth's docs estimate 5GB RAM for E2B and E4B at 4-bit, about 18GB for 26B-A4B at 4-bit, and about 20GB for 31B at 4-bit.

Those memory numbers help explain why the conversation got so practical so quickly. The family spans phone-class experiments, Apple Silicon laptops, and larger local boxes without changing model family.

Unsloth's free Colab path

itsPaulAi's demo is lightweight but useful: open the notebook, launch Unsloth Studio, pick a model and dataset, then start training. That is a much shorter path from launch-day curiosity to a task-specific Gemma build than most open-model releases get in week one.

The supporting product docs are already in place. Unsloth's Gemma 4 page says Studio can run GGUFs and fine-tune Gemma 4, while the project's v0.1.35-beta notes add same-week support for all four sizes. Separately, one early user said Gemma 4 had already replaced cloud-hosted models for private daily chats, with agentic coding as the remaining exception.