Skip to content
AI Primer
release

vLLM releases v0.22.0 with 28.9% FP8 latency cuts and KV offloading

vLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.

3 min read
vLLM releases v0.22.0 with 28.9% FP8 latency cuts and KV offloading
vLLM releases v0.22.0 with 28.9% FP8 latency cuts and KV offloading

TL;DR

  • vllm_project's release post says vLLM 0.22.0 landed with 459 commits from 230 contributors, including 63 first-time contributors.
  • The headline perf change is a batch-invariant Cutlass FP8 path that, according to vllm_project's hardware thread, cuts end-to-end latency by 28.9%.
  • vllm_project's engine-core breakdown also turns DeepSeek V4 support into a more serious serving path, with NVFP4 fused MoE, fuller CUDA graph support, ROCm work, and multi-tier KV cache offloading.
  • The upgrade is not frictionless: vllm_project's upgrade notes removes deprecated tokenizer and chat-template locations, drops old MLA prefill arguments, and clears out dead CUDA code.

The release thread is unusually dense for a point update. You get engine changes like an in-tree Rust frontend and Model Runner V2 work, hardware-specific kernels for Blackwell, ROCm, CPU, and RISC-V, plus new model architecture support for MiniCPM-V 4.6, InternS2 Preview, and OpenVLA.

DeepSeek V4 and Model Runner V2

Most of the release weight sits in the serving stack, not model count. vllm_project's engine-core breakdown says DeepSeek V4 now lives in a dedicated package with NVFP4 fused MoE, full and piecewise CUDA graphs, MTP speculative decoding on ROCm, and more fused kernels.

The same post adds four Model Runner V2 changes:

  • oracle selection for Qwen3 dense by default
  • sleep-mode weight reload
  • update_config
  • shared KV-cache layers

There is also an escape hatch. When KV connectors are present, the engine-core thread says the runtime falls back to Model Runner V1.

FP8, Blackwell, and multi-tier KV offloading

The cleanest number in the release is the FP8 latency claim. vllm_project's hardware thread attributes a 28.9% end-to-end improvement to the batch-invariant Cutlass FP8 path, and adds compile-mode support on SM80 plus an NVFP4 Cutlass linear path.

The hardware matrix also widened:

  • Blackwell gets FlashInfer MoE and FP4 GEMM on SM120 and SM121, plus per-tensor FP8 CUTLASS on SM12.1, per the hardware thread.
  • ROCm picks up more DeepSeek V4 functionality, flash sparse MLA Triton kernels, and gluon paged MQA logits, again in the same post.
  • CPU and RISC-V get RVV-optimized attention kernels, fused GDN for AMX, and experimental Triton with MRv2, according to vllm_project.

Cache management also moved. The engine-core thread says KV cache offloading now supports multiple tiers, including a Python filesystem secondary tier, DeepSeek V4 support, and Mooncake disk offloading.

Upgrade breaks and new model coverage

The release adds support for MiniCPM-V 4.6, InternS2 Preview, and OpenVLA, according to vllm_project's model-support post. The same thread also calls out speculative decoding updates, including custom callable proposers and Gemma3 and Gemma4 multi-GPU fixes with batched vision encoders.

The migration gotchas are short but real. vllm_project's upgrade notes says vLLM removed deprecated get_tokenizer and resolve_hf_chat_template locations, eliminated deprecated MLA prefill arguments, and deleted dead CUDA kernels and code.

One day later, vllm_project's Laguna XS.2 post showed the team already using the release as a serving showcase: a DFlash speculator drafting eight tokens per forward pass for Laguna XS.2, with claimed 2x to 3x faster decoding and quantized FP8, NVFP4, and INT4 checkpoints.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
Upgrade breaks and new model coverage1 post
Share on X