releaseApril 3, 2026

Gemma 4 adds NVFP4 and vLLM support after a +30 Arena benchmark gain

Gemma 4 picked up benchmark, quantization, and runtime support across NVIDIA, Apple MLX, vLLM, and local toolchains a day after launch. The 31B and MoE variants can now be tested on laptops, RTX rigs, and serving stacks without custom integrations.

6 min read

Gemma 4 adds NVFP4 and vLLM support after a +30 Arena benchmark gain

TL;DR

Gemma 4 shipped as a four-model family under Apache 2.0, spanning E2B, E4B, a 26B A4B MoE, and a 31B dense model, with multimodal input across the lineup and bigger context windows than Gemma 3 HN launch summary Artificial Analysis breakdown.
Early benchmark chatter centered on the 31B model: Arena said it sits about 30 points above similarly priced peers on its Pareto chart Arena Pareto chart, while Artificial Analysis scored it three points behind Qwen3.5 27B on overall intelligence but far more token efficient Artificial Analysis thread Token efficiency follow-up.
Serving support arrived fast. vLLM v0.19.0 added full Gemma 4 support, including MoE, multimodal, reasoning, and tool use, alongside a stack of engine upgrades like zero-bubble async scheduling, CPU KV offload, and fuller CUDA graph capture vLLM release thread vLLM engine details.
NVIDIA used the launch to push a Blackwell-oriented Gemma 4 path too, including an NVFP4 checkpoint for Gemma-4-31B with vLLM, while RTX and llama.cpp benchmarks started circulating within hours RTX speed claim vLLM hardware support.
The first day looked less like a clean model drop and more like a runtime readiness test: HN commenters were already swapping Ollama, llama.cpp, LiteRT-LM, and OpenCode results, and Unsloth showed E4B doing a full repo audit with local Bash tool calls on 6GB RAM HN deployment delta Unsloth local tool-use demo.

You can read the DeepMind launch post, the vLLM Gemma 4 announcement, and NVIDIA's two deployment writeups, one on Blackwell and NVFP4 and one on RTX and llama.cpp. The vLLM v0.19.0 release notes landed the next day, while the main HN thread immediately turned into a compatibility and local-serving logbook.

Gemma 4 model map

DeepMind's launch post frames Gemma 4 as the Gemini 3-derived open family, not a single flagship checkpoint. The practical split is straightforward: E2B and E4B for tighter memory budgets, 26B A4B as the MoE middle tier, and 31B dense for the best reasoning numbers.

Hacker News

Gemma 4 — Google DeepMind

1.7k upvotes · 455 comments

Artificial Analysis pulled the family into one clean inventory Artificial Analysis breakdown:

31B dense: 256K context, text-image-video input.
26B A4B MoE: 27B total parameters, about 4B active, 256K context.
E4B: 8B total parameters, 128K context, adds native audio input.
E2B: 5.1B total, 2.3B active, 128K context, also supports audio input.

The release also changed the licensing story. Gemma 4 moved to Apache 2.0, where Gemma 3 had shipped under Gemma Terms of Use Artificial Analysis breakdown.

Arena scores and token budget

The loudest chart on launch day was the Arena Pareto plot. Arena placed Gemma 4 31B roughly 30 points above similarly priced models like DeepSeek 3.2, with the usual caveat that the pricing inputs were still early third-party estimates.

Artificial Analysis gave the more useful engineer readout: Gemma 4 31B landed at 39 on its Intelligence Index, behind Qwen3.5 27B Reasoning at 42, but it used 39 million output tokens to run the index versus Qwen's 98 million and GLM-4.7's 167 million Artificial Analysis breakdown Token efficiency follow-up.

That made two different launch stories run in parallel:

Raw rank: strong, but not category-leading.
Generational jump: much larger than Gemma 3, with the 31B model up 29 points by Artificial Analysis's scoring Gemma 3 to 4 jump.
Token budget: unusually lean for a reasoning model in this tier Token efficiency follow-up.

Artificial Analysis also noted where the gap remained. Gemma 4 31B stayed competitive on several non-agentic evals, but trailed Qwen3.5 mainly on agentic performance Artificial Analysis breakdown.

vLLM lands day-zero support

vLLM treated Gemma 4 like a first-class launch, not a later compatibility patch. Its launch post said support was immediate across multiple hardware backends, including day-zero support on Google TPUs, AMD GPUs, and Intel XPUs, and the 0.19.0 release notes spelled out full architecture support for MoE, multimodal, reasoning, and tool use.

The release notes tied Gemma 4 support to a broader inference-stack upgrade v0.19.0:

zero-bubble async scheduling now works with speculative decoding
Model Runner V2 adds piecewise CUDA graphs for pipeline parallelism
ViT full CUDA graph capture lands for vision encoders
CPU KV cache offloading becomes general, with pluggable policy and block-level preemption
DBO expands beyond a few architectures
online MXFP8 quantization covers MoE and dense models

The other hardware-side addition came from NVIDIA. Its technical post said an NVFP4 quantized Gemma-4-31B checkpoint was already available through NVIDIA Model Optimizer for Blackwell developers using vLLM, a very specific sign that Google and inference vendors had pre-coordinated the rollout NVIDIA technical post.

Local runtimes still needed fixes

The community evidence looked good, but messy in the familiar open-model-launch way. The HN thread quickly filled with reports from Ollama, llama.cpp, LiteRT-LM, and OpenCode-style harnesses, plus notes on memory, backend flags, and parser bugs HN deployment delta.

A few details stood out:

Unsloth showed Gemma 4 E4B in 4-bit mode completing a repo audit with local Bash execution and tool calls on 6GB RAM Unsloth local tool-use demo.
One HN commenter said Gemma 4 hit Ollama bugs around reasoning and streaming in an agentic coding setup, then worked better through llama.cpp for multi-turn tool calls HN comment on tool-calling bugs.
Another said litert-lm --backend gpu on Apple Silicon gave a large prefill and decode speedup HN comment on LiteRT-LM GPU backend.
llama.cpp itself merged a same-day Gemma 4 tokenizer fix, which tells you how fresh the support still was.

That combination, solid demos plus same-day parser and tokenizer fixes, is basically Christmas come early for coding agent nerds.

Phones and edge devices

Google also used Gemma 4 to push a more aggressive on-device story than the usual laptop demo loop. The Google Developers post pitched Gemma 4 for multi-step planning, autonomous action, offline code generation, and function-calling-style agent flows on-device, with access through Google AI Edge and Android's new AICore Developer Preview Google Developers post.

NVIDIA's launch material made the same point from the hardware side, positioning the family across Blackwell servers, RTX workstations, DGX Spark, and Jetson edge systems RTX deployment post NVIDIA technical post.

The odd little launch artifact was that the consumer-facing iPhone path showed up almost immediately too. A tweet linking Google's AI Edge Gallery app claimed Gemma 4 models were already runnable offline on iPhone, with "Thinking Mode" and "Agent Skills" called out in the update card AI Edge Gallery iPhone app.

🧾 More sources

TL;DR3 tweets

Top-line launch facts, benchmark framing, serving support, and first-day deployment signal.

Hacker News

Fresh discussion on Google releases Gemma 4 open models

1.7k upvotes · 455 comments

Gemma 4 model map1 tweets

Model sizes, modalities, context windows, and licensing changes across the family.

Arena scores and token budget3 tweets

Benchmark positioning, token-efficiency data, and the size of the generational jump.

vLLM lands day-zero support1 tweets

Serving-stack readiness, hardware backends, and release-note level inference changes around Gemma 4.