KV Cache
Cache hit rate, offloading, routing, and cache-aware systems.
Stories
Filter storiesvLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.
SGLang v0.5.12 added native DeepSeek V4 support with ShadowRadix prefix caching, HiSparse CPU-extended KV, MegaMoE kernels, and Blackwell MLA work. The release broadens hardware targets and improves long-context serving efficiency for open runtimes.
DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.
vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.
Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.
New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.
Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.
TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.
oMLX now supports local Claude Code setups on Apple Silicon with tiered KV cache and an Anthropic Messages API-compatible endpoint, with one setup reporting roughly 10x faster performance than mlx_lm-style serving. If you want private on-device coding agents, point Claude Code at a local compatible endpoint and disable the attribution header to preserve cache reuse.