Skip to content
AI Primer
release

vLLM, Unsloth, and llama.cpp add DiffusionGemma support after launch

Google's new diffusion text model picked up same-day runtime support: vLLM added native diffusion-LM serving, Unsloth shipped GGUFs, and llama.cpp got local setup guidance. That shortens the path from release to local and hosted evaluation.

4 min read
vLLM, Unsloth, and llama.cpp add DiffusionGemma support after launch
vLLM, Unsloth, and llama.cpp add DiffusionGemma support after launch

TL;DR

  • Google shipped DiffusionGemma as an Apache 2.0, 26B MoE diffusion text model that generates whole text blocks instead of next-token steps, with Google claiming up to 4x faster generation in its Google launch thread and Introducing DiffusionGemma.
  • The runtime story moved almost immediately: vllm_project's support post called it the first diffusion LLM natively served by vLLM, while UnslothAI's launch post shipped GGUFs and local run guidance the same day.
  • Local experimentation got two paths on day one, because UnslothAI's launch post promised 18GB RAM operation and danielhanchen's llama.cpp post showed a dedicated llama-diffusion-cli path for realtime visualization and chat.
  • The interesting technical wrinkle is that vLLM did not just add a model slug, because its DiffusionGemma support post says diffusion serving needed bidirectional attention, iterative refinement, block generation, and custom denoising-time sampling.

You can jump straight from Google's developer guide to vLLM's implementation notes, then over to Unsloth's setup doc. danielhanchen's llama.cpp post even exposed a separate diffusion CLI with a live visualization mode, which is more revealing than another benchmark chart.

Runtime support landed the same day

DiffusionGemma's short path from launch to usable tooling is the whole story here. Google announced the model at roughly the same time that vllm_project's support post and UnslothAI's launch post started pointing developers at serving and local execution paths.

That matters because diffusion text models usually arrive with a tooling gap. Here, the gap was narrower: hosted serving had an official vLLM writeup, and local users had quantized weights plus setup docs through Unsloth's guide on day one.

vLLM's serving path

According to vllm_project's support post, DiffusionGemma denoises 256-token blocks in parallel instead of emitting one token at a time, and hit 1200-plus output tokens per second at batch size 1 on a single H200 in FP8.

The more useful detail lives in the vLLM blog post, which says native support leaned on model runner v2's ModelState abstraction plus the existing speculative decoding path. vLLM says that let it keep scheduler and runner changes minimal while still handling four behaviors autoregressive stacks do not need to juggle:

  • bidirectional attention
  • iterative refinement
  • block-based generation
  • custom sampling at each denoising step

That is a real infrastructure clue. DiffusionGemma support was not framed as a separate sidecar runtime, it was folded into the main serving stack.

Local paths: Unsloth and llama.cpp

Unsloth's angle was speed to local use. UnslothAI's launch post said the 26B-A4B model could run locally on 18GB RAM, linked GGUF weights, and pointed people at a setup guide. That guide adds a few concrete claims: 256K context, support for text, image, and video inputs, and a documented path through either Unsloth Studio or llama.cpp.

The llama.cpp route was more experimental. danielhanchen's post showed two modes exposed through llama-diffusion-cli:

  • normal chat CLI mode
  • realtime diffusion visualization mode

A llama.cpp pull request tied to that post describes the support as preliminary and builds a dedicated diffusion binary rather than dropping straight into the standard llama-cli or server path. The same post also linked a Sudoku fine-tuning notebook and argued diffusion generation is stronger on refinement-style tasks because it can revise earlier tokens during generation.

Official toolchain

Google's own docs rounded out the launch with a broader tool map than the tweets alone suggested. The launch post says developers can serve DiffusionGemma through MLX, vLLM, and Hugging Face Transformers, while the developer guide adds benchmark points of 700-plus tokens per second on an RTX 5090 and 1000-plus on a single H100.

That leaves a clean split across the day-one ecosystem:

  • Google's docs covered the official model, guide, and reference toolchains.
  • vLLM explained how diffusion serving fit into a production runtime.
  • Unsloth packaged local GGUFs and fine-tuning help.
  • llama.cpp exposed a proof-of-concept local runner with its own diffusion-specific CLI.

For a low-key model launch, that is Christmas-come-early infrastructure coverage.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
TL;DR1 post
Share on X