Google DeepMind released Gemma 4 in four open models with up to 256K context, multimodal inputs, and native tool-calling for local agent workflows. Day-0 support across serving stacks and benchmark wins make it ready for phones, laptops, and server GPUs.

You can read Google's announcement, skim the official model overview, and browse Hugging Face's ecosystem post. There is also an Android Studio post for offline agentic coding, plus a WebGPU demo showing Gemma 4 running fully in browser.
The release breaks into two hardware tiers. According to Google DeepMind, E2B and E4B target mobile and edge workloads, while 26B A4B and 31B aim at local reasoning on PCs and workstations.
The naming also signals three different architectures. The official docs describe E2B and E4B as effective-parameter edge models for ultra-mobile, edge, and browser deployment, 31B as a dense model, and 26B A4B as a Mixture-of-Experts system tuned for high-throughput reasoning.
A few concrete specs matter more than the marketing:
Google's best headline is parameter efficiency. Arena ranked Gemma 4 31B third among open models with a 1452 score, and Arena's follow-up called it the top-ranked US open model while noting it is far smaller than GLM-5 and Kimi-K2.5.
The 26B A4B model is the more interesting entry for engineers who care about throughput. Arena placed it sixth among open models at 1441, close enough to the 31B result that the MoE design looks like more than a packaging trick.
Artificial Analysis added a second useful readout: Gemma 4 31B scored 85.7% on GPQA Diamond, just behind Qwen3.5 27B in its sub-40B class, and did it with about 1.2 million output tokens. That token-efficiency chart is one of the few pieces of outside evidence that says something new beyond Arena Elo.
The fastest sign of real demand was how much inference plumbing appeared on day one. Ollama published run commands for all four variants, SGLang posted server examples with Gemma-specific reasoning and tool-call parsers, and vLLM shipped a containerized quick start.
The Hugging Face ecosystem post makes the same point from another angle. It positioned Gemma 4 as already wired into agent frameworks, inference engines, and fine-tuning libraries, which is why this launch immediately turned into deployment recipes instead of waiting for community ports.
That stack coverage spans very different targets:
The edge story is broader than just "small enough to fit." Phil Schmid pointed to Android Studio support for Gemma 4 as a local agent for app development, and the Android Developers post frames that as offline agentic coding inside the IDE.
Google's LiteRT-LM overview fills in the rest: Gemma can run across Android, iOS, web, desktop, and IoT targets, with GPU and NPU acceleration, multimodal inputs, and constrained function calling for on-device agent workflows. That is a much more ambitious deployment surface than the usual "open weights on a workstation" launch.
The weirdly compelling extra is the browser path. Delangue highlighted a Hugging Face Space running Gemma 4 fully locally with transformers.js and WebGPU, and Unsloth's launch guide claimed the small variants can run around the 6 GB RAM range, including phones. Between LiteRT, browser inference, and Android Studio, Gemma 4 shipped with a clearer on-device story than most open model launches manage on week one.