HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.

<turn|>.You can jump from Google's main launch post to the more deployment-focused AI Edge writeup, skim the HN thread for real hardware notes, and then open Unsloth's Gemma 4 page if the interesting bit for you is not inference but getting a tuned variant running quickly.
Gemma 4 — Google DeepMind
1.8k upvotes · 472 comments
The release is unusually broad for an open family. The DeepMind page and Google's announcement split the lineup into two tiny effective models for phones and edge devices, one 26B Mixture-of-Experts model, and one 31B dense model.
Google is also pushing Gemma 4 as more than a chat model. The official materials emphasize multimodal input, agentic workflows, and local deployment paths, while the companion Google Developers post highlights AICore, Google AI Edge, and LiteRT-LM as the intended on-device stack.
Discussion around Google releases Gemma 4 open models
1.8k upvotes · 472 comments
The headline datapoint came from a practitioner test, not from a benchmark chart. In the HN summary, commenter d4rkp4ttern said the 26B-A4B model delivered about 40 tokens per second in a Claude Code style harness on Mac hardware, and called it clearly faster than Qwen3.5-35B-A3B in that setup.
That matters mostly because the workload is closer to how engineers abuse local models in practice: long iterative generations, tool calls, and constant prompt churn. The same discussion digest also notes that people were immediately comparing runtimes, including Ollama, llama-server, LiteRT-LM, and Modular MAX, instead of only repeating the launch benchmarks.
Google releases Gemma 4 open models
1.8k upvotes · 472 comments
Two practical details surfaced fast in the thread:
Those memory numbers help explain why the conversation got so practical so quickly. The family spans phone-class experiments, Apple Silicon laptops, and larger local boxes without changing model family.
itsPaulAi's demo is lightweight but useful: open the notebook, launch Unsloth Studio, pick a model and dataset, then start training. That is a much shorter path from launch-day curiosity to a task-specific Gemma build than most open-model releases get in week one.
The supporting product docs are already in place. Unsloth's Gemma 4 page says Studio can run GGUFs and fine-tune Gemma 4, while the project's v0.1.35-beta notes add same-week support for all four sizes. Separately, one early user said Gemma 4 had already replaced cloud-hosted models for private daily chats, with agentic coding as the remaining exception.