Skip to content
AI Primer
release

Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support

Google published Gemma 4 QAT checkpoints and mobile-focused quant formats, cutting Gemma 4 E2B to roughly 1GB of memory. Ollama, SGLang, and vLLM added day-one support, making local deployment more practical on phones, laptops, and low-VRAM GPUs.

6 min read
Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support
Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support

TL;DR

  • Google shipped Gemma 4 QAT checkpoints across the full family, and UnslothAI's launch post says the headline result is roughly 3x lower memory with near-original performance, while Official Google documentation says the release covers Q4_0 checkpoints plus a separate mobile format.
  • The smallest edge model is the big unlock: osanseviero highlighted a new mobile quantization format that brings E2B to about 1 GB, and the Google blog post adds that a text-only E2B build can come in under 1 GB.
  • Serving support landed immediately across local stacks, with lmsysorg announcing SGLang support, vllm_project saying vLLM is Google's recommended serving path, and ollama publishing ready-to-run model tags.
  • The QAT drop also came with a quantization gotcha: danielhanchen says naive Q4_0 conversion can trash accuracy, while Unsloth's QAT docs show its repack recovering much of that gap on 26B-A4B and 31B.
  • Early community testing is already more specific than the launch copy, because the LocalLLaMA benchmark thread found the official Google q4_0 build more reliable than an Unsloth repack on fabrication traps, even as both still failed a false-premise prompt about Napoleon.

You can read the official announcement, browse Google's Hugging Face QAT collection, and check the Unsloth QAT guide for the size table Google's post omits. The checkpoint drop hit Hugging Face immediately, which huggingface's repost and mervenoyann's repost both amplified, and the rollout rode on the same ecosystem push Google used for Gemma 4 12B two days earlier.

Checkpoints

The release covers Gemma 4 E2B, E4B, 12B, 26B-A4B, and 31B in QAT form, according to Official Google documentation. Google's Hugging Face model cards say the drop includes four delivery paths: unquantized QAT checkpoints for downstream conversion, GGUF Q4_0 files, mobile-optimized checkpoints, and MTP-aware variants for the models with drafters.

The practical memory story is wider than the 1 GB headline. Unsloth's size table lists QAT GGUF footprints of 2.62 GB for E2B, 4.22 GB for E4B, 6.72 GB for 12B, 14.2 GB for 26B-A4B, and 17.3 GB for 31B, each about 72% smaller than the corresponding BF16 weights.

Mobile format

Google's mobile-specific schema is the real engineering detail in this release, not just another batch of 4-bit files. The official post breaks it into four pieces:

  • Static activations, which precompute scaling during training instead of on-device.
  • Channel-wise quantization, which maps better to mobile accelerators.
  • Targeted 2-bit quantization for token-generation components, while reasoning layers stay at higher precision.
  • Embedding and KV cache compression, which cuts active memory during long chats.

That format is only for the edge models. Official Google documentation says E2B and E4B get the mobile-specialized treatment, while the whole family gets QAT-tuned Q4_0 checkpoints. Google also says you can drop unused modalities, which is how the text-only E2B variant falls below 1 GB.

Serving support

Google used the launch to hard-code the serving story. The official post points to GGUF for llama.cpp, compressed tensors for vLLM, LiteRT-LM for edge deployment, Transformers.js for the browser, SGLang and vLLM for larger serving, MLX for Apple Silicon, and Hugging Face Transformers plus Unsloth for fine-tuning.

Ollama turned that into copy-paste tags on day one, with ollama publishing:

  • ollama run gemma4:e2b-it-qat
  • ollama run gemma4:e4b-it-qat
  • ollama run gemma4:12b-it-qat
  • ollama run gemma4:26b-a4b-it-qat
  • ollama run gemma4:31b-it-qat

The speed claim also survived quantization. lmsysorg says Q4_0 plus MTP checkpoints preserve the MTP speedup, and Google's post says those MTP QAT checkpoints are part of the release.

12B setup

The QAT rollout landed just two days after Gemma 4 12B, so most of the ecosystem plumbing was already warm. lmsysorg had already wired up the encoder-free 12B model in SGLang, vllm_project had already exposed its reasoning, vision, and audio paths through an OpenAI-compatible API, and demishassabis framed that 12B release as part of a Gemma line that had already passed 150 million downloads.

That earlier 12B launch also explains why Google keeps talking about laptops and edge hardware. The Gemma 4 12B developer guide positioned 12B as a local multimodal model for 16 GB machines; QAT extends the same local-first pitch downward, especially for E2B and E4B, and outward to more runtimes.

Q4_0 conversion gotcha

The cleanest non-marketing detail in the rollout is that QAT does not automatically survive every downstream conversion. danielhanchen says a naive jump from QAT BF16 to llama.cpp's Q4_0 lattice dropped 26B-A4B top-1 accuracy from 85.6% to 70.2%, and dropped 31B from 96.7% to 87.9%.

Unsloth's QAT guide says its dynamic repack recovers most of that loss while shaving about 200 MB off the 26B file. Google's own post quietly points at the same issue from the other side, because it ships unquantized QAT checkpoints for people who want to do their own downstream compilation instead of treating every Q4_0 path as equivalent.

Early hands-on

r/localLLM

Gemma4 E2B QAT: I ran fabrication traps and sycophancy tests. Very interesting reasoning traces.

3 comments

The first usage reports already split into two buckets: throughput demos and behavior tests. ai_for_success showed gemma-4-12b-qat running in LM Studio at about 53 tokens per second on an M5 Max with 128 GB memory, while the LocalLLaMA benchmark thread ran the smaller E2B QAT builds on a 16 GB Ryzen laptop and focused on hallucination-style failure modes instead of vibes.

That Reddit test found the official Google q4_0 build and an Unsloth repack behaved similarly on several tasks, but not identically. According to the LocalLLaMA benchmark thread, both builds failed a false-premise Napoleon prompt, both showed consensus-skew on the Younger Dryas Impact Hypothesis, and the official Google build had the better baseline refusal rate on fabrication traps.

The weirdest finding is about reasoning traces, not memory. the LocalLLaMA benchmark thread reports that one failing run recognized a fake theory inside its chain of thought, told itself not to fabricate, then emitted a detailed invented explanation anyway. That is a much more useful first caveat than any launch benchmark chart.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR1 post
Checkpoints2 posts
Serving support1 post
12B setup1 post
Share on X