releaseApril 2, 2026

Google DeepMind releases Gemma 4 under Apache 2.0 with 31B Dense, 26B MoE, and 256K context

Google DeepMind shipped four Gemma 4 models with multimodal input, including 31B Dense, 26B MoE, and two edge variants available through AI Studio, Hugging Face, Kaggle, and Ollama. Early community tests say local performance and usable context windows still vary by runtime, quantization, and GPU memory.

4 min read

Google DeepMind releases Gemma 4 under Apache 2.0 with 31B Dense, 26B MoE, and 256K context

TL;DR

Google DeepMind's launch thread introduced Gemma 4 as an Apache 2.0 family of open models, with four releases split between workstation-class reasoning models and edge variants for phones and small devices.
Google DeepMind's agent demo said the larger models support native tool use and up to 256K context, while Official blog post adds structured JSON output, function calling, and system instructions.
Logan Kilpatrick's launch post positioned the 31B Dense and 26B A4B MoE models as frontier local options, and the Hugging Face release note pegs the MoE at 26B total parameters with 4B active at inference.
Google DeepMind's availability post said Gemma 4 landed day one in Google AI Studio, Hugging Face, Kaggle, and Ollama, while the vLLM launch post added same-day support across TPUs, AMD GPUs, and Intel XPUs.
Early community feedback was less tidy: one Reddit thread from Openclaw users liked the small-model routing story, but the same thread also logged context-window and quantization trouble on a 16GB card.

You can read Google's launch post, skim the Hugging Face integration guide, and check the vLLM day-zero support note. The weirdly useful bit is how broad the rollout is: AI Studio for instant prompting, local runtimes from Ollama to MLX, and edge models that Google says are already aimed at Android devices and Raspberry Pi class hardware.

Four model sizes, two very different jobs

Gemma 4 ships as four models, but they break into two clear tiers.

31B Dense: the quality-first model for local reasoning and fine-tuning, according to Google's blog post
26B A4B MoE: 26B total parameters, about 4B active per token, according to the Hugging Face writeup
E4B: a 4.5B effective edge model with 128K context, according to Hugging Face
E2B: a 2.3B effective edge model for the smallest hardware targets, also with 128K context, according to Hugging Face

Google's pitch is blunt: the 31B and 26B models are supposed to punch above their size, while the edge pair is tuned for battery, RAM, and latency instead of leaderboard bragging rights.

Agent workflows are the main product story

The release is framed less as chatbot refresh, more as local agent substrate. Google's own checklist in the official announcement is unusually specific:

native function calling
structured JSON output
native system instructions
multimodal input across text, images, and video
audio input on E2B and E4B
256K context on 26B and 31B

The Hugging Face post adds a few architecture details that matter for creators building tools around it: variable-aspect-ratio vision input, configurable image token budgets, Per-Layer Embeddings, and a shared KV cache meant to cut long-context memory cost.

Day-one rollout is unusually broad

Gemma 4 showed up in more places than most open-model launches.

Prompt it immediately: Google AI Studio
Download weights: Hugging Face, Kaggle, Ollama
Run in common stacks: transformers, llama.cpp, MLX, WebGPU, Mistral.rs, according to Hugging Face
Serve it at launch: vLLM, with day-zero support across Google TPUs, AMD GPUs, and Intel XPUs

That matters because the model is clearly being pitched as something you can move between a local agent, a workstation coding setup, and a mobile or edge deployment without waiting for the ecosystem to catch up.

Local reality still depends on runtime and memory

r/openclaw

Gemma 4 is now available: An accessible and capable local model to self-host for Openclaw

11 comments

The first community thread in the evidence pool reads like the usual first 24 hours of local AI: excitement, then hardware math. One poster framed E2B as a 4 to 8GB RAM option for CPU or VPS use, and another suggested the small models were attractive for routing and RAG-style copilots.

The most concrete counterpoint came from a user trying E4B and 26B MoE quants on a 16GB RTX 5060 Ti. In that same Openclaw thread, they said some setups only ran after cutting the context window to 64K or 32K, then rolled back to Qwen while runtimes caught up.

TL;DR

Four model sizes, two very different jobs

Agent workflows are the main product story

Day-one rollout is unusually broad

Local reality still depends on runtime and memory

Discussion across the web