Skip to content
AI Primer
release

Google DeepMind releases Gemma 4 under Apache 2.0 with 31B Dense, 26B MoE, and 256K context

Google DeepMind shipped four Gemma 4 models with multimodal input, including 31B Dense, 26B MoE, and two edge variants available through AI Studio, Hugging Face, Kaggle, and Ollama. Early community tests say local performance and usable context windows still vary by runtime, quantization, and GPU memory.

4 min read
Google DeepMind releases Gemma 4 under Apache 2.0 with 31B Dense, 26B MoE, and 256K context
Google DeepMind releases Gemma 4 under Apache 2.0 with 31B Dense, 26B MoE, and 256K context

TL;DR

You can read Google's launch post, skim the Hugging Face integration guide, and check the vLLM day-zero support note. The weirdly useful bit is how broad the rollout is: AI Studio for instant prompting, local runtimes from Ollama to MLX, and edge models that Google says are already aimed at Android devices and Raspberry Pi class hardware.

Four model sizes, two very different jobs

Gemma 4 ships as four models, but they break into two clear tiers.

  • 31B Dense: the quality-first model for local reasoning and fine-tuning, according to Google's blog post
  • 26B A4B MoE: 26B total parameters, about 4B active per token, according to the Hugging Face writeup
  • E4B: a 4.5B effective edge model with 128K context, according to Hugging Face
  • E2B: a 2.3B effective edge model for the smallest hardware targets, also with 128K context, according to Hugging Face

Google's pitch is blunt: the 31B and 26B models are supposed to punch above their size, while the edge pair is tuned for battery, RAM, and latency instead of leaderboard bragging rights.

Agent workflows are the main product story

The release is framed less as chatbot refresh, more as local agent substrate. Google's own checklist in the official announcement is unusually specific:

  • native function calling
  • structured JSON output
  • native system instructions
  • multimodal input across text, images, and video
  • audio input on E2B and E4B
  • 256K context on 26B and 31B

The Hugging Face post adds a few architecture details that matter for creators building tools around it: variable-aspect-ratio vision input, configurable image token budgets, Per-Layer Embeddings, and a shared KV cache meant to cut long-context memory cost.

Day-one rollout is unusually broad

Gemma 4 showed up in more places than most open-model launches.

  • Prompt it immediately: Google AI Studio
  • Download weights: Hugging Face, Kaggle, Ollama
  • Run in common stacks: transformers, llama.cpp, MLX, WebGPU, Mistral.rs, according to Hugging Face
  • Serve it at launch: vLLM, with day-zero support across Google TPUs, AMD GPUs, and Intel XPUs

That matters because the model is clearly being pitched as something you can move between a local agent, a workstation coding setup, and a mobile or edge deployment without waiting for the ecosystem to catch up.

Local reality still depends on runtime and memory

r/openclaw

Gemma 4 is now available: An accessible and capable local model to self-host for Openclaw

11 comments

The first community thread in the evidence pool reads like the usual first 24 hours of local AI: excitement, then hardware math. One poster framed E2B as a 4 to 8GB RAM option for CPU or VPS use, and another suggested the small models were attractive for routing and RAG-style copilots.

The most concrete counterpoint came from a user trying E4B and 26B MoE quants on a 16GB RTX 5060 Ti. In that same Openclaw thread, they said some setups only ran after cutting the context window to 64K or 32K, then rolled back to Qwen while runtimes caught up.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
Apache 2.01 post
Share on X