releaseApril 2, 2026

Gemma 4 ships 31B Dense and 26B MoE open models under Apache 2.0

Google DeepMind released Gemma 4 in four open models with up to 256K context, multimodal inputs, and native tool-calling for local agent workflows. Day-0 support across serving stacks and benchmark wins make it ready for phones, laptops, and server GPUs.

5 min read

Gemma 4 ships 31B Dense and 26B MoE open models under Apache 2.0

TL;DR

Google DeepMind's launch thread introduced four Apache 2.0 models, E2B, E4B, 26B A4B, and 31B, while the Google blog post framed Gemma 4 as the company's most capable open release yet.
Google DeepMind said the larger models support up to 256K context with native tool use for multi-step agents, and the official docs split the family across edge, dense, and MoE deployments.
Arena's launch ranking put Gemma 4 31B at No. 3 among open models and 26B A4B at No. 6, while Artificial Analysis measured the 31B model at 85.7% on GPQA Diamond with relatively low token usage.
Ollama, SGLang, and vLLM all shipped day-0 support, which is the part that makes this release feel immediately usable instead of merely benchmark-ready.
Phil Schmid's roundup and the LiteRT-LM overview extended the story beyond servers: Gemma 4 also landed in Android Studio, AI Edge tooling, and browser-side demos.

You can read Google's announcement, skim the official model overview, and browse Hugging Face's ecosystem post. There is also an Android Studio post for offline agentic coding, plus a WebGPU demo showing Gemma 4 running fully in browser.

Four-model lineup

The release breaks into two hardware tiers. According to Google DeepMind, E2B and E4B target mobile and edge workloads, while 26B A4B and 31B aim at local reasoning on PCs and workstations.

The naming also signals three different architectures. The official docs describe E2B and E4B as effective-parameter edge models for ultra-mobile, edge, and browser deployment, 31B as a dense model, and 26B A4B as a Mixture-of-Experts system tuned for high-throughput reasoning.

A few concrete specs matter more than the marketing:

E2B and E4B support up to 128K context, according to the official docs.
26B A4B and 31B extend that to 256K, according to Google DeepMind and the Hugging Face model card.
The family is multimodal. Google DeepMind pitched text, image, audio, and video handling, while the Google blog post and Hugging Face post both emphasize native tool calling and agent workflows.
Google says all four ship under Apache 2.0, via the launch thread and the official announcement.

Arena and GPQA numbers

Google's best headline is parameter efficiency. Arena ranked Gemma 4 31B third among open models with a 1452 score, and Arena's follow-up called it the top-ranked US open model while noting it is far smaller than GLM-5 and Kimi-K2.5.

The 26B A4B model is the more interesting entry for engineers who care about throughput. Arena placed it sixth among open models at 1441, close enough to the 31B result that the MoE design looks like more than a packaging trick.

Artificial Analysis added a second useful readout: Gemma 4 31B scored 85.7% on GPQA Diamond, just behind Qwen3.5 27B in its sub-40B class, and did it with about 1.2 million output tokens. That token-efficiency chart is one of the few pieces of outside evidence that says something new beyond Arena Elo.

Day-0 support across serving stacks

The fastest sign of real demand was how much inference plumbing appeared on day one. Ollama published run commands for all four variants, SGLang posted server examples with Gemma-specific reasoning and tool-call parsers, and vLLM shipped a containerized quick start.

The Hugging Face ecosystem post makes the same point from another angle. It positioned Gemma 4 as already wired into agent frameworks, inference engines, and fine-tuning libraries, which is why this launch immediately turned into deployment recipes instead of waiting for community ports.

That stack coverage spans very different targets:

Ollama for local CLI and laptop workflows.
SGLang for custom serving with explicit reasoning and tool-call parsers.
vLLM for OpenAI-compatible endpoints on GPUs and TPUs.
Google DeepMind for direct access in Google AI Studio, plus weight downloads through Hugging Face, Kaggle, and Ollama.

Phones, browsers, and offline IDEs

The edge story is broader than just "small enough to fit." Phil Schmid pointed to Android Studio support for Gemma 4 as a local agent for app development, and the Android Developers post frames that as offline agentic coding inside the IDE.

Google's LiteRT-LM overview fills in the rest: Gemma can run across Android, iOS, web, desktop, and IoT targets, with GPU and NPU acceleration, multimodal inputs, and constrained function calling for on-device agent workflows. That is a much more ambitious deployment surface than the usual "open weights on a workstation" launch.

The weirdly compelling extra is the browser path. Delangue highlighted a Hugging Face Space running Gemma 4 fully locally with transformers.js and WebGPU, and Unsloth's launch guide claimed the small variants can run around the 6 GB RAM range, including phones. Between LiteRT, browser inference, and Android Studio, Gemma 4 shipped with a clearer on-device story than most open model launches manage on week one.

TL;DR

Four-model lineup

Arena and GPQA numbers

Day-0 support across serving stacks

Phones, browsers, and offline IDEs

Discussion across the web