vLLM
Open source serving engine for efficient LLM inference.
Stories
Filter storiesGoogle's new diffusion text model picked up same-day runtime support: vLLM added native diffusion-LM serving, Unsloth shipped GGUFs, and llama.cpp got local setup guidance. That shortens the path from release to local and hosted evaluation.
vLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.
The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.
vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.
Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.