Skip to content
AI Primer
release

vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades

vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.

5 min read
vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades
vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades

TL;DR

  • vllm_project's release thread makes vLLM 0.20.0 a baseline-reset release, with CUDA 13.0.2, PyTorch 2.11, Python 3.14, and Transformers v5 all moving together as breaking defaults.
  • The biggest serving-side change in the release thread is TurboQuant 2-bit KV cache, which vLLM says delivers 4x KV capacity, alongside FA4 returning as the default MLA prefill backend on SM90+.
  • vllm_project's DeepSeek V4 update turns this into more than a compatibility bump, adding Blackwell-only MegaMoE plus fixes for Hopper crashes, long-context hangs, reasoning accuracy regression, and broken tool-argument rendering.
  • One practical infrastructure detail from vllm_project's PyTorch note is that PyTorch 2.11 is the first release with a default aarch64 CUDA wheel on PyPI, which removes the old extra-index setup for GB200 and Grace-Blackwell boxes.
  • The release also widens model and hardware coverage: vllm_project's MiMo-V2.5 post added day-0 MiMo-V2.5 support, while vllm_project's FP8 KV-cache post points to new FA3 accumulation work that lifted a 128k needle-in-a-haystack test from 13% to 89%.

You can jump straight to the official release notes, compare the updated DeepSeek V4 recipe, and read the linked AWS × Red Hat AI FP8 KV-cache writeup. Meanwhile, the HN discussion summary around DeepSeek V4 kept circling back to something operational, not theoretical: Flash looked like the practical serving target, while Pro still looked constrained in the wild.

Baseline shift

The release notes and thread both treat the new software stack as a breaking baseline, not a routine dependency refresh: CUDA 13.0.2, PyTorch 2.11, Python 3.14, and Transformers v5 now move together in vLLM 0.20.0 (official release notes).

According to vllm_project's PyTorch note, the PyTorch piece matters on its own because 2.11 is the first release with a default aarch64 CUDA wheel on PyPI. That means GB200 and Grace-Blackwell setups can get working CUDA from plain pip install torch, without the extra package index that ARM-CUDA installs used to require.

The hardware matrix widened at the same time:

TurboQuant and FA4

The core performance inventory is unusually dense, so it reads better as a parts list than prose:

  • TurboQuant 2-bit KV cache: 4x KV capacity, with FA3 and FA4 prefill support, per vllm_project's engine update
  • FA4 MLA prefill: re-enabled as the default backend for head-dim 512 plus paged-KV on SM90+, per the same engine update
  • Model Runner V2: Eagle prefill full CUDA graph, multiple prompt logprobs, and a stale-token accuracy fix, per vllm_project's thread
  • vLLM IR: initial skeleton plus an rms_norm op, which vllm_project describes as groundwork for future kernel work
  • Fused RMS norm: batch-invariant path with a reported 2.1% end-to-end latency gain, per the release thread

That KV-cache work connects to a separate technical note. In vllm_project's FP8 KV-cache post, vLLM says a two-level accumulation fix in FA3 pushed a 128k needle-in-a-haystack result from 13% to 89% while preserving FP8 decode speedups.

DeepSeek V4 upgrades

Y
Hacker News

DeepSeek v4

2.1k upvotes · 1.6k comments

The DeepSeek V4 portion is the clearest example of 0.20.0 acting like an upgrade path, not just a new tag. For teams already running the day-0 Docker image, vllm_project's DeepSeek V4 post lists one major new kernel path and five fixes that did not ship in that first image.

Community discussion around the model gives some extra context for why those fixes matter. The HN discussion summary highlights comments arguing that V4 Flash looked like the practical serving target because it was cheap and fast, while other commenters focused on coding-agent behavior, published weights, and deployment constraints across NVIDIA and Ascend references.

MiMo and hybrid-attention fixes

vLLM used the 0.20.0 rollout to show two more edges of the stack.

First, vllm_project's MiMo-V2.5 announcement added day-0 support for Xiaomi's MiMo-V2.5 and MiMo-V2.5-Pro. The attached recipe screenshot points to the kind of deployment profile vLLM is targeting here: a 1.02T total, 42B active MoE reasoning model, native FP8 weights, a 1,048,576-token context window, and speculative decoding configured through MTP.

Second, vllm_project's FP8 KV-cache note adds a new --kv-cache-dtype-skip-layers flag for hybrid-attention models such as gpt-oss. That is a small flag with a very specific job: keeping the FP8 KV-cache path usable on architectures where attention behavior is not uniform across layers.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

Share on X