Skip to content
AI Primer
update

Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.

4 min read
Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests
Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

TL;DR

  • Google DeepMind's launch page positions Gemma 4 as a four-model open family, E2B, E4B, 26B-A4B, and 31B, with multimodal support, tool use, and 140-plus languages under Apache 2.0; Official announcement adds a 256K context window and the main benchmark claims.
  • In the strongest practitioner datapoint so far, the HN core roundup cites a Mac code-agent test where the 26B-A4B variant ran at roughly 40 tokens per second, with the commenter calling it well ahead of recent open-weight rivals in that harness.
  • Small-model results also look unusually usable: according to the HN discussion digest, Gemma-4-E4B-it scored 15 out of 25 on one SQL benchmark, while the 4-bit E2B variant still reached 12 out of 25.
  • Setup details are already converging in the wild. The same HN roundup points to community quants plus inference guidance, including temperature 1.0, top_p 0.95, top_k 64, and an EOS token of <turn|>.
  • itsPaulAi's Unsloth post shows a free Google Colab flow for fine-tuning Gemma 4 through Unsloth Studio, which lines up with Unsloth's Gemma 4 docs and the project's Gemma 4 release notes.

You can jump from Google's main launch post to the more deployment-focused AI Edge writeup, skim the HN thread for real hardware notes, and then open Unsloth's Gemma 4 page if the interesting bit for you is not inference but getting a tuned variant running quickly.

Gemma 4's shape

Y
Hacker News

Gemma 4 — Google DeepMind

1.8k upvotes · 472 comments

The release is unusually broad for an open family. The DeepMind page and Google's announcement split the lineup into two tiny effective models for phones and edge devices, one 26B Mixture-of-Experts model, and one 31B dense model.

Google is also pushing Gemma 4 as more than a chat model. The official materials emphasize multimodal input, agentic workflows, and local deployment paths, while the companion Google Developers post highlights AICore, Google AI Edge, and LiteRT-LM as the intended on-device stack.

The Mac harness number engineers actually cared about

Y
Hacker News

Discussion around Google releases Gemma 4 open models

1.8k upvotes · 472 comments

The headline datapoint came from a practitioner test, not from a benchmark chart. In the HN summary, commenter d4rkp4ttern said the 26B-A4B model delivered about 40 tokens per second in a Claude Code style harness on Mac hardware, and called it clearly faster than Qwen3.5-35B-A3B in that setup.

That matters mostly because the workload is closer to how engineers abuse local models in practice: long iterative generations, tool calls, and constant prompt churn. The same discussion digest also notes that people were immediately comparing runtimes, including Ollama, llama-server, LiteRT-LM, and Modular MAX, instead of only repeating the launch benchmarks.

Small variants and quantized setups look stronger than expected

Y
Hacker News

Google releases Gemma 4 open models

1.8k upvotes · 472 comments

Two practical details surfaced fast in the thread:

  • HN discussion cites an independent SQL-style test where E4B-it scored 15 out of 25.
  • The same source says the 4-bit E2B variant still managed 12 out of 25.
  • HN core links community quants from Unsloth and records recommended sampling settings: temperature 1.0, top_p 0.95, top_k 64.
  • Unsloth's docs estimate 5GB RAM for E2B and E4B at 4-bit, about 18GB for 26B-A4B at 4-bit, and about 20GB for 31B at 4-bit.

Those memory numbers help explain why the conversation got so practical so quickly. The family spans phone-class experiments, Apple Silicon laptops, and larger local boxes without changing model family.

Unsloth's free Colab path

itsPaulAi's demo is lightweight but useful: open the notebook, launch Unsloth Studio, pick a model and dataset, then start training. That is a much shorter path from launch-day curiosity to a task-specific Gemma build than most open-model releases get in week one.

The supporting product docs are already in place. Unsloth's Gemma 4 page says Studio can run GGUFs and fine-tune Gemma 4, while the project's v0.1.35-beta notes add same-week support for all four sizes. Separately, one early user said Gemma 4 had already replaced cloud-hosted models for private daily chats, with agentic coding as the remaining exception.

Share on X