vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades
vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.

TL;DR
- vllm_project's release thread makes vLLM 0.20.0 a baseline-reset release, with CUDA 13.0.2, PyTorch 2.11, Python 3.14, and Transformers v5 all moving together as breaking defaults.
- The biggest serving-side change in the release thread is TurboQuant 2-bit KV cache, which vLLM says delivers 4x KV capacity, alongside FA4 returning as the default MLA prefill backend on SM90+.
- vllm_project's DeepSeek V4 update turns this into more than a compatibility bump, adding Blackwell-only MegaMoE plus fixes for Hopper crashes, long-context hangs, reasoning accuracy regression, and broken tool-argument rendering.
- One practical infrastructure detail from vllm_project's PyTorch note is that PyTorch 2.11 is the first release with a default aarch64 CUDA wheel on PyPI, which removes the old extra-index setup for GB200 and Grace-Blackwell boxes.
- The release also widens model and hardware coverage: vllm_project's MiMo-V2.5 post added day-0 MiMo-V2.5 support, while vllm_project's FP8 KV-cache post points to new FA3 accumulation work that lifted a 128k needle-in-a-haystack test from 13% to 89%.
You can jump straight to the official release notes, compare the updated DeepSeek V4 recipe, and read the linked AWS × Red Hat AI FP8 KV-cache writeup. Meanwhile, the HN discussion summary around DeepSeek V4 kept circling back to something operational, not theoretical: Flash looked like the practical serving target, while Pro still looked constrained in the wild.
Baseline shift
The release notes and thread both treat the new software stack as a breaking baseline, not a routine dependency refresh: CUDA 13.0.2, PyTorch 2.11, Python 3.14, and Transformers v5 now move together in vLLM 0.20.0 (official release notes).
According to vllm_project's PyTorch note, the PyTorch piece matters on its own because 2.11 is the first release with a default aarch64 CUDA wheel on PyPI. That means GB200 and Grace-Blackwell setups can get working CUDA from plain pip install torch, without the extra package index that ARM-CUDA installs used to require.
The hardware matrix widened at the same time:
- NVIDIA: Jetson Thor support, SM100 MoE paths, TRTLLM GEN NVFP4 MoE, per vllm_project's platform update
- AMD: zentorch on Zen CPU, MORI EP, AITER MLA plus Eagle3, and RDNA 3.5 and 4 device IDs, per the same platform update
- Intel: torch 2.11 support, MXFP8 and MXFP4 quantization, and FP8 KV cache, per vllm_project's thread
TurboQuant and FA4
The core performance inventory is unusually dense, so it reads better as a parts list than prose:
- TurboQuant 2-bit KV cache: 4x KV capacity, with FA3 and FA4 prefill support, per vllm_project's engine update
- FA4 MLA prefill: re-enabled as the default backend for head-dim 512 plus paged-KV on SM90+, per the same engine update
- Model Runner V2: Eagle prefill full CUDA graph, multiple prompt logprobs, and a stale-token accuracy fix, per vllm_project's thread
- vLLM IR: initial skeleton plus an
rms_normop, which vllm_project describes as groundwork for future kernel work - Fused RMS norm: batch-invariant path with a reported 2.1% end-to-end latency gain, per the release thread
That KV-cache work connects to a separate technical note. In vllm_project's FP8 KV-cache post, vLLM says a two-level accumulation fix in FA3 pushed a 128k needle-in-a-haystack result from 13% to 89% while preserving FP8 decode speedups.
DeepSeek V4 upgrades
DeepSeek v4
2.1k upvotes · 1.6k comments
The DeepSeek V4 portion is the clearest example of 0.20.0 acting like an upgrade path, not just a new tag. For teams already running the day-0 Docker image, vllm_project's DeepSeek V4 post lists one major new kernel path and five fixes that did not ship in that first image.
- MegaMoE via DeepGEMM MegaMoE: opt-in with
--moe-backend deep_gemm_mega_moe, Blackwell only, per vllm_project's V4 update - MTP greater than 1 crash on Hopper: fixed, per the same update
top_kindexer correctness: fixed, per vllm_project's post- Shared-experts SwiGLU clipping: fixed, which vllm_project says resolves an accuracy regression on reasoning requests
- Long-context engine hangs: fixed, per the DeepSeek V4 note
- Tool arguments in chat templates: now rendered correctly, per the same note
Community discussion around the model gives some extra context for why those fixes matter. The HN discussion summary highlights comments arguing that V4 Flash looked like the practical serving target because it was cheap and fast, while other commenters focused on coding-agent behavior, published weights, and deployment constraints across NVIDIA and Ascend references.
MiMo and hybrid-attention fixes
vLLM used the 0.20.0 rollout to show two more edges of the stack.
First, vllm_project's MiMo-V2.5 announcement added day-0 support for Xiaomi's MiMo-V2.5 and MiMo-V2.5-Pro. The attached recipe screenshot points to the kind of deployment profile vLLM is targeting here: a 1.02T total, 42B active MoE reasoning model, native FP8 weights, a 1,048,576-token context window, and speculative decoding configured through MTP.
Second, vllm_project's FP8 KV-cache note adds a new --kv-cache-dtype-skip-layers flag for hybrid-attention models such as gpt-oss. That is a small flag with a very specific job: keeping the FP8 KV-cache path usable on architectures where attention behavior is not uniform across layers.