Google DeepMind released Gemma 4 as four Apache 2.0 multimodal models spanning edge to workstation sizes, with up to 256K context and native tool use. Use the new models for local or self-hosted deployment; day-0 support in AI Studio, Ollama, vLLM, SGLang, Hugging Face, and Baseten makes rollout practical.

You can read the official launch post, skim Hugging Face's unusually detailed architecture and deployment writeup, and check the vLLM launch post for the hardware matrix. The LiteRT overview already lists Gemma-4-E2B as a featured on-device model, and the Hacker News thread immediately turned into a local deployment discussion.
The family splits cleanly into two edge models and two larger reasoning models. Google positions E2B and E4B for mobile and real-time multimodal use, while 26B A4B and 31B target heavier local reasoning and coding workloads Google's size chart.
Hugging Face's release post fills in the table Google only sketches in the thread:
That effective-parameter framing is the trick behind the small models. The Hugging Face post says Gemma 4 keeps the per-layer embeddings design from Gemma-3n, so the edge models carry larger total parameter counts than their active runtime footprint suggests.
Google's pitch is unusually explicit here: Gemma 4 is meant for agents that plan, navigate apps, search databases, and trigger APIs tool use and context. LMSYS's day-0 SGLang note adds the serving details engineers actually care about, including dedicated reasoning-parser and tool-call-parser support for all four checkpoints SGLang summary.
The capability bundle looks like this:
The official launch post and Hugging Face breakdown also describe a hybrid attention stack: alternating sliding-window and full-context layers, proportional RoPE on the global layers for long context, and shared KV cache in later layers to cut long-context inference cost.
This was a broad release, fast. Google put Gemma 4 into AI Studio and linked weight downloads from Hugging Face, Kaggle, and Ollama on announcement day official availability. Ollama required version 0.20+ and exposed all four variants with simple ollama run tags on day 0 Ollama day-0 support.
The rest of the serving surface landed quickly too:
That breadth matters more than another leaderboard screenshot. Open models usually arrive as weights first and production ergonomics later. Gemma 4 showed up with the inference ecosystem already waiting for it.
The early benchmark story is mostly about parameter efficiency. Arena posted Gemma 4 31B at #3 among open models and the 26B A4B variant at #6 Arena ranking. A follow-up Arena post framed the 31B model as 24 times smaller than GLM-5 and 34 times smaller than Kimi-K2.5 by total parameter count while landing in the same neighborhood parameter comparison.
Artificial Analysis put more numbers on that picture. It reported Gemma 4 31B Reasoning at 85.7% on GPQA Diamond, just behind Qwen3.5 27B Reasoning at 85.8%, and noted that Gemma used about 1.2M output tokens versus roughly 1.5M and 1.6M for two nearby Qwen reasoning models Artificial Analysis GPQA token usage.
The 26B A4B result is less dramatic, 79.2% on GPQA Diamond in that same post, but it is the more revealing model architecturally. According to the Hugging Face analysis, it activates only 4B parameters inside a 26B MoE, which is exactly the kind of trade that makes local reasoning models more deployable than their headline size suggests.
Gemma 4 did not stop at server inference. Hugging Face's CEO pointed to a WebGPU demo running fully in-browser through transformers.js browser demo, and Google's LiteRT overview lists Gemma-4-E2B as a featured cross-platform edge model for Android, iOS, web, desktop, and even Raspberry Pi.
Phil Schmid's roundup adds the last piece: Gemma 4 support in Android Studio for local agentic coding, plus hooks into Vertex Model Garden, Vertex fine-tuning, ADK, Cloud Run, GKE, and TPU-backed vLLM paths ecosystem extras. That is a bigger release footprint than the benchmark charts imply.
If there is a memorable detail from launch day, it is that Gemma 4 appeared simultaneously as a model family, a serving target, and an on-device runtime story.
Available in four sizes: 🔵 31B Dense & 26B MoE: state-of-the-art performance for advanced local reasoning tasks – like custom coding assistants or analyzing scientific datasets. 🔵 E4B & E2B (Edge): built for mobile with real-time text, vision, and audio processing. Show more
Build autonomous agents that plan, navigate apps, and execute multi-step tasks – like searching databases or triggering APIs – with native tool use. With up to 256K context, it can analyze full codebases and retain complex action histories without losing focus.
Start building with Gemma 4 now in @GoogleAIStudio. You can also download the model weights from @HuggingFace, @Kaggle, or @Ollama. Find out more → goo.gle/41IC3lY
.@GoogleDeepMind Gemma 4 is here with state-of-the-art models targeting edge and workstations. Requires Ollama 0.20+ that is rolling out. 4 models: 4B Effective (E4B) ollama run gemma4:e4b 2B Effective (E2B) ollama run gemma4:e2b 26B (4B active MoE) ollama run gemma4:26b Show more
Start experimenting with Gemma 4 now in @GoogleAIStudio or download the model weights from @HuggingFace, @Kaggle and @Ollama. Learn more → goo.gle/48ef4TB
Gemma-4-31B is now live in Text Arena - ranking #3 among open models (#27 overall), matching much larger models at 10× smaller scale! A significant jump from Gemma-3-27B (+87 pts). Highlights: - #3 open (#27 overall), on par with the best open models Kimi-K2.5, Qwen-3.5-397b - Show more
Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵
Google has released Gemma 4, a new family of multimodal open-weight models including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 31B and Gemma 4 26B A4B @GoogleDeepMind’s new Gemma 4 family introduces four multimodal models supporting text, image, and video inputs. We evaluated Gemma 4 Show more
You can run Gemma 4 100% locally in your browser thanks to HF transformers.js. That means 100% private and 100% free! @xenovacom created a demo for it here: huggingface.co/spaces/webml-c…
You can run Google new Gemma 4 on mobile easily. I am using Gemma 4 version E2B on my Pixel 10 Pro. Here is all you need to do: - Go to the App Store and install Google AI Edge Gallery. If you already have it, just update it. - From there, you can install the model directly and Show more