Google releases DiffusionGemma 26B-A4B with 4x faster block-based text decoding
Google released Apache 2.0 DiffusionGemma, a 26B-A4B diffusion text model that claims up to 4x faster output by generating text in blocks instead of one token at a time. The release matters for local and hosted stacks that want to test a new decoding path.

TL;DR
- Google shipped DiffusionGemma as an Apache 2.0 open model, a 26B-A4B MoE that generates text in blocks instead of one token at a time, according to Google's launch thread and the official launch post.
- The core pitch is speed: Google on parallel generation says the model can exceed 1,000 tokens per second and deliver up to 4x faster output, while the developer guide pins that to 700+ tok/s on RTX 5090 and 1,000+ tok/s on H100.
- Google is also positioning it as a local model, not a straight Gemma 4 replacement: the official launch post says standard autoregressive Gemma 4 still wins on quality, and Google on parallel generation says quantized deployment fits within 18 GB VRAM.
- The interesting engineering detail is that DiffusionGemma keeps generation left to right across blocks, but denoises 256 tokens in parallel inside each block, as the developer guide and Google's diffusion explainer both describe.
- Day one support landed fast: vLLM's launch post calls it the first diffusion LLM natively supported in vLLM, while UnslothAI's rollout post pushed GGUF weights and local run instructions within hours.
You can read Google's developer guide, inspect the vLLM integration writeup, and even see a llama.cpp PR racing to optimize diffusion throughput on day one. The weirdly useful detail is that Google's own launch post spends as much time on where DiffusionGemma does not win, high-QPS cloud serving and maximum-quality outputs, as on the headline speed claim.
256-token canvas
DiffusionGemma's main change is not a new tokenizer trick or a scheduler tweak. It replaces token-by-token decoding with iterative denoising over a fixed 256-token canvas.
According to the developer guide, each block works in two phases:
- Prefill and commit: causal attention writes prompt or finalized block state into the KV cache.
- Denoising: bidirectional attention lets every position on the 256-token canvas attend to every other position.
- Block handoff: once a block converges, it is committed left to right and the next 256-token canvas starts.
That block-autoregressive setup matters because it preserves long-form left-to-right generation while giving the model a chance to self-correct inside each block. Google's developer guide calls out code infilling, inline editing, math, and other non-linear structures as the places where that bidirectional pass is useful.
Speed and tradeoffs
Google is selling DiffusionGemma on hardware utilization. The launch materials repeatedly frame ordinary local decoding as memory-bandwidth-bound, and diffusion as a way to turn the same accelerator into a more compute-heavy workload.
The concrete claims from Google break down like this:
- Model size: 26B total parameters, 3.8B active at inference, per the developer guide.
- Local footprint: quantized runs fit in 18 GB VRAM, according to Google on parallel generation.
- Peak throughput: 700+ tok/s on RTX 5090 and 1,000+ tok/s on H100, per the developer guide.
- Headline speedup: up to 4x faster than other Gemma 4 models, according to Google's launch thread and demishassabis's reaction post.
Google also buried the caveats in unusually plain language. The official launch post says output quality is lower than standard Gemma 4, says the speedup is aimed at local and low-concurrency inference, and says high-QPS cloud serving can erase the advantage or even raise serving costs. The same post adds another platform caveat: Apple Silicon's unified-memory design may not see the same acceleration.
Serving path
The most interesting infrastructure reveal came from vLLM. Instead of building a separate serving stack, the team reused the speculative decoding path already inside vLLM.
According to the vLLM integration post, the implementation hinges on a few diffusion-specific pieces:
- ModelState hooks manage per-request canvas state, self-conditioning, and attention mode switches.
- A custom sampler runs the denoise loop, accepts confident positions under an entropy budget, and re-noises the rest.
- Dynamic per-sequence causal attention lets prefill, denoise, and commit requests coexist in one batch.
- Zero-token denoise steps reuse speculative-decoding accounting without moving the KV cache until a 256-token block commits.
That let vLLM keep the scheduler, model runner, and much of the Gemma 4 backbone path intact. The post also says the result matches Hugging Face reference accuracy while enabling efficient batched serving, and vLLM's launch post cites 1,200+ output tok/s at batch size 1 on a single H200 using FP8.
Local stack rollout
The rollout matters almost as much as the model. Google launched DiffusionGemma with links to Hugging Face weights, a developer guide, and official mentions of vLLM, Hugging Face Transformers, SGLang, MLX, NVIDIA NIM, Model Garden, and future llama.cpp support in the launch post.
The ecosystem filled in the rest quickly:
- vLLM's launch post says DiffusionGemma is the first diffusion LLM natively supported in vLLM.
- UnslothAI's rollout post and UnslothAI on local runs pushed GGUF builds and local instructions immediately.
- A llama.cpp PR was already posting throughput improvements on June 10, before official llama.cpp support landed.
- NVIDIA's deployment post positioned the model across H100, DGX Spark, DGX Station, and RTX systems, with BF16 and NVFP4 deployment paths.
That makes DiffusionGemma feel less like a pure research drop and more like Google trying to seed a new inference pattern into the open model toolchain, starting with local-first stacks where block diffusion has the clearest shot at beating ordinary decoding.