releaseApril 2, 2026

Google DeepMind releases Gemma 4: 4 Apache 2.0 multimodal models from edge to 31B dense

Google DeepMind released Gemma 4 as four Apache 2.0 multimodal models spanning edge to workstation sizes, with up to 256K context and native tool use. Use the new models for local or self-hosted deployment; day-0 support in AI Studio, Ollama, vLLM, SGLang, Hugging Face, and Baseten makes rollout practical.

5 min read

Google DeepMind releases Gemma 4: 4 Apache 2.0 multimodal models from edge to 31B dense

TL;DR

Google DeepMind released Gemma 4 as four Apache 2.0 open models, E2B, E4B, 26B A4B MoE, and 31B dense, spanning phone-class deployment up to workstation and single-accelerator inference launch thread model lineup.
The headline features are agentic ones: native tool use, configurable thinking, multimodal input, and up to 256K context on the larger models agentic capabilities SGLang summary.
Google shipped it straight into the serving stack on day 0, including AI Studio, Hugging Face, Kaggle, Ollama, SGLang, vLLM, and Baseten availability Ollama support vLLM quick start.
Early leaderboard posts put Gemma 4 31B near much larger open models, with Arena ranking it #3 among open models and Artificial Analysis reporting 85.7% on GPQA Diamond for the reasoning variant Arena ranking GPQA result.
The more interesting part is packaging, not just the benchmark score: Google also pushed Gemma 4 into LiteRT, AI Edge Gallery, browser demos, and Android Studio, which makes the release feel like an open local-agent stack instead of a single checkpoint drop mobile demo ecosystem extras.

You can read the official launch post, skim Hugging Face's unusually detailed architecture and deployment writeup, and check the vLLM launch post for the hardware matrix. The LiteRT overview already lists Gemma-4-E2B as a featured on-device model, and the Hacker News thread immediately turned into a local deployment discussion.

Four model sizes

Google DeepMind

@GoogleDeepMind

·Follow

Replying to @GoogleDeepMind

Available in four sizes: 🔵 31B Dense & 26B MoE: state-of-the-art performance for advanced local reasoning tasks – like custom coding assistants or analyzing scientific datasets. 🔵 E4B & E2B (Edge): built for mobile with real-time text, vision, and audio processing. Show more

4:03 PM · Apr 2, 2026

1.0K

Read 36 replies

The family splits cleanly into two edge models and two larger reasoning models. Google positions E2B and E4B for mobile and real-time multimodal use, while 26B A4B and 31B target heavier local reasoning and coding workloads Google's size chart.

Hugging Face's release post fills in the table Google only sketches in the thread:

E2B: 2.3B effective parameters, 5.1B with embeddings, 128K context
E4B: 4.5B effective parameters, 8B with embeddings, 128K context
26B A4B: 26B total parameters, 4B active MoE, 256K context
31B: dense 31B model, 256K context

That effective-parameter framing is the trick behind the small models. The Hugging Face post says Gemma 4 keeps the per-layer embeddings design from Gemma-3n, so the edge models carry larger total parameter counts than their active runtime footprint suggests.

Agentic features

Google DeepMind

@GoogleDeepMind

·Follow

Replying to @GoogleDeepMind

Build autonomous agents that plan, navigate apps, and execute multi-step tasks – like searching databases or triggering APIs – with native tool use. With up to 256K context, it can analyze full codebases and retain complex action histories without losing focus.

Watch on X

4:03 PM · Apr 2, 2026

574

Read 13 replies

Google's pitch is unusually explicit here: Gemma 4 is meant for agents that plan, navigate apps, search databases, and trigger APIs tool use and context. LMSYS's day-0 SGLang note adds the serving details engineers actually care about, including dedicated reasoning-parser and tool-call-parser support for all four checkpoints SGLang summary.

The capability bundle looks like this:

Native function and tool calling
Thinking mode for step-by-step reasoning
Multimodal input across text and images for all models
Audio support on the small E2B and E4B variants
Up to 256K context on 26B A4B and 31B
Support for more than 140 languages, according to the official materials

The official launch post and Hugging Face breakdown also describe a hybrid attention stack: alternating sliding-window and full-context layers, proportional RoPE on the global layers for long context, and shared KV cache in later layers to cut long-context inference cost.

Day-0 deployment surface

Google DeepMind

@GoogleDeepMind

·Follow

Replying to @GoogleDeepMind

Start building with Gemma 4 now in @GoogleAIStudio. You can also download the model weights from @HuggingFace, @Kaggle, or @Ollama. Find out more → goo.gle/41IC3lY

Watch on X

4:06 PM · Apr 2, 2026

463

Read 6 replies

ollama

@ollama

·Follow

.@GoogleDeepMind Gemma 4 is here with state-of-the-art models targeting edge and workstations. Requires Ollama 0.20+ that is rolling out. 4 models: 4B Effective (E4B) ollama run gemma4:e4b 2B Effective (E2B) ollama run gemma4:e2b 26B (4B active MoE) ollama run gemma4:26b Show more

Google

@Google

Start experimenting with Gemma 4 now in @GoogleAIStudio or download the model weights from @HuggingFace, @Kaggle and @Ollama. Learn more → goo.gle/48ef4TB

Watch on X

4:14 PM · Apr 2, 2026

2.9K

Read 77 replies

This was a broad release, fast. Google put Gemma 4 into AI Studio and linked weight downloads from Hugging Face, Kaggle, and Ollama on announcement day official availability. Ollama required version 0.20+ and exposed all four variants with simple ollama run tags on day 0 Ollama day-0 support.

The rest of the serving surface landed quickly too:

SGLang added day-0 support with Gemma 4 reasoning and tool-call parsers SGLang summary
vLLM published a launch post and Docker quick start, with support across NVIDIA GPUs, Google TPUs, AMD GPUs, and Intel XPUs vLLM quick start
Baseten added the full family to its model library and called out OCR, function calling, and 256K context in its launch note Baseten availability

That breadth matters more than another leaderboard screenshot. Open models usually arrive as weights first and production ergonomics later. Gemma 4 showed up with the inference ecosystem already waiting for it.

Efficiency and benchmark shape

Arena.ai

@arena

·Follow

Gemma-4-31B is now live in Text Arena - ranking #3 among open models (#27 overall), matching much larger models at 10× smaller scale! A significant jump from Gemma-3-27B (+87 pts). Highlights: - #3 open (#27 overall), on par with the best open models Kimi-K2.5, Qwen-3.5-397b - Show more

Google DeepMind

@GoogleDeepMind

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

Watch on X

4:19 PM · Apr 2, 2026

497

Read 17 replies

Artificial Analysis

@ArtificialAnlys

·Follow

Google has released Gemma 4, a new family of multimodal open-weight models including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 31B and Gemma 4 26B A4B @GoogleDeepMind’s new Gemma 4 family introduces four multimodal models supporting text, image, and video inputs. We evaluated Gemma 4 Show more

5:09 PM · Apr 2, 2026

501

Read 15 replies

The early benchmark story is mostly about parameter efficiency. Arena posted Gemma 4 31B at #3 among open models and the 26B A4B variant at #6 Arena ranking. A follow-up Arena post framed the 31B model as 24 times smaller than GLM-5 and 34 times smaller than Kimi-K2.5 by total parameter count while landing in the same neighborhood parameter comparison.

Artificial Analysis put more numbers on that picture. It reported Gemma 4 31B Reasoning at 85.7% on GPQA Diamond, just behind Qwen3.5 27B Reasoning at 85.8%, and noted that Gemma used about 1.2M output tokens versus roughly 1.5M and 1.6M for two nearby Qwen reasoning models Artificial Analysis GPQA token usage.

The 26B A4B result is less dramatic, 79.2% on GPQA Diamond in that same post, but it is the more revealing model architecturally. According to the Hugging Face analysis, it activates only 4B parameters inside a 26B MoE, which is exactly the kind of trade that makes local reasoning models more deployable than their headline size suggests.

Browser, phones, and Android Studio

clem 🤗

@ClementDelangue

·Follow

You can run Gemma 4 100% locally in your browser thanks to HF transformers.js. That means 100% private and 100% free! @xenovacom created a demo for it here: huggingface.co/spaces/webml-c…

7:11 PM · Apr 2, 2026

746

Read 20 replies

AshutoshShrivastava

@ai_for_success

·Follow

You can run Google new Gemma 4 on mobile easily. I am using Gemma 4 version E2B on my Pixel 10 Pro. Here is all you need to do: - Go to the App Store and install Google AI Edge Gallery. If you already have it, just update it. - From there, you can install the model directly and Show more

Watch on X

6:02 PM · Apr 2, 2026

489

Read 22 replies

Gemma 4 did not stop at server inference. Hugging Face's CEO pointed to a WebGPU demo running fully in-browser through transformers.js browser demo, and Google's LiteRT overview lists Gemma-4-E2B as a featured cross-platform edge model for Android, iOS, web, desktop, and even Raspberry Pi.

Phil Schmid's roundup adds the last piece: Gemma 4 support in Android Studio for local agentic coding, plus hooks into Vertex Model Garden, Vertex fine-tuning, ADK, Cloud Run, GKE, and TPU-backed vLLM paths ecosystem extras. That is a bigger release footprint than the benchmark charts imply.

If there is a memorable detail from launch day, it is that Gemma 4 appeared simultaneously as a model family, a serving target, and an on-device runtime story.

🧾 More sources

TL;DR3 tweets

Top-line release facts, capabilities, distribution, and benchmark positioning.

Agentic features1 tweets

Tool use, reasoning mode, multimodality, and context length.

Day-0 deployment surface3 tweets

Immediate availability across Google and third-party inference stacks.

Efficiency and benchmark shape2 tweets

Arena and GPQA results, with emphasis on parameter and token efficiency.

Browser, phones, and Android Studio1 tweets

On-device, browser, and Android developer distribution details that extend the launch beyond server inference.