vLLM releases v0.22.0 with 28.9% FP8 latency cuts and KV offloading
vLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.

TL;DR
- vllm_project's release post says vLLM 0.22.0 landed with 459 commits from 230 contributors, including 63 first-time contributors.
- The headline perf change is a batch-invariant Cutlass FP8 path that, according to vllm_project's hardware thread, cuts end-to-end latency by 28.9%.
- vllm_project's engine-core breakdown also turns DeepSeek V4 support into a more serious serving path, with NVFP4 fused MoE, fuller CUDA graph support, ROCm work, and multi-tier KV cache offloading.
- The upgrade is not frictionless: vllm_project's upgrade notes removes deprecated tokenizer and chat-template locations, drops old MLA prefill arguments, and clears out dead CUDA code.
The release thread is unusually dense for a point update. You get engine changes like an in-tree Rust frontend and Model Runner V2 work, hardware-specific kernels for Blackwell, ROCm, CPU, and RISC-V, plus new model architecture support for MiniCPM-V 4.6, InternS2 Preview, and OpenVLA.
DeepSeek V4 and Model Runner V2
Most of the release weight sits in the serving stack, not model count. vllm_project's engine-core breakdown says DeepSeek V4 now lives in a dedicated package with NVFP4 fused MoE, full and piecewise CUDA graphs, MTP speculative decoding on ROCm, and more fused kernels.
The same post adds four Model Runner V2 changes:
- oracle selection for Qwen3 dense by default
- sleep-mode weight reload
update_config- shared KV-cache layers
There is also an escape hatch. When KV connectors are present, the engine-core thread says the runtime falls back to Model Runner V1.
FP8, Blackwell, and multi-tier KV offloading
The cleanest number in the release is the FP8 latency claim. vllm_project's hardware thread attributes a 28.9% end-to-end improvement to the batch-invariant Cutlass FP8 path, and adds compile-mode support on SM80 plus an NVFP4 Cutlass linear path.
The hardware matrix also widened:
- Blackwell gets FlashInfer MoE and FP4 GEMM on SM120 and SM121, plus per-tensor FP8 CUTLASS on SM12.1, per the hardware thread.
- ROCm picks up more DeepSeek V4 functionality, flash sparse MLA Triton kernels, and gluon paged MQA logits, again in the same post.
- CPU and RISC-V get RVV-optimized attention kernels, fused GDN for AMX, and experimental Triton with MRv2, according to vllm_project.
Cache management also moved. The engine-core thread says KV cache offloading now supports multiple tiers, including a Python filesystem secondary tier, DeepSeek V4 support, and Mooncake disk offloading.
Upgrade breaks and new model coverage
The release adds support for MiniCPM-V 4.6, InternS2 Preview, and OpenVLA, according to vllm_project's model-support post. The same thread also calls out speculative decoding updates, including custom callable proposers and Gemma3 and Gemma4 multi-GPU fixes with batched vision encoders.
The migration gotchas are short but real. vllm_project's upgrade notes says vLLM removed deprecated get_tokenizer and resolve_hf_chat_template locations, eliminated deprecated MLA prefill arguments, and deleted dead CUDA kernels and code.
One day later, vllm_project's Laguna XS.2 post showed the team already using the release as a serving showcase: a DFlash speculator drafting eight tokens per forward pass for Laguna XS.2, with claimed 2x to 3x faster decoding and quantized FP8, NVFP4, and INT4 checkpoints.