releaseMay 16, 2026

SGLang 0.5.12 adds DeepSeek V4 serving with ShadowRadix and HiSparse

SGLang v0.5.12 added native DeepSeek V4 support with ShadowRadix prefix caching, HiSparse CPU-extended KV, MegaMoE kernels, and Blackwell MLA work. The release broadens hardware targets and improves long-context serving efficiency for open runtimes.

5 min read

SGLang 0.5.12 adds DeepSeek V4 serving with ShadowRadix and HiSparse

TL;DR

SGLang 0.5.12 folds DeepSeek V4 serving into mainline, and lmsysorg's launch thread says the initial merge shipped ShadowRadix prefix caching, HiSparse CPU-extended KV, MTP speculative decoding, and a new MegaMoE kernel stack.
Long-context serving is the headline engineering claim: according to lmsysorg's launch thread, HiSparse pushes sparse-attention KV onto CPU and can reach up to 3 times higher long-context throughput.
The release is also a hardware expansion, with lmsysorg's hardware list naming H100, H200, B200, B300, GB200, GB300, and MI35X support, while later additions in lmsysorg's follow-up add Blackwell MLA work and Hopper-specific MXFP4 MoE paths.
Since the initial V4 merge, lmsysorg's follow-up says SGLang added HiCache under UnifiedRadixTree, W4A4 MegaMoE kernels, pipeline parallelism, TP16 on H100 and H20, and a single Docker image across supported Nvidia hardware.

You can jump straight to the full v0.5.12 release notes, skim lmsysorg's ship list, and compare it with what Hacker News discussion around DeepSeek V4 cared about most: the two-model split, the 1M-token context window, and the practical gap between Flash and Pro. One useful detail buried in the thread is that the SGLang work is not just about model compatibility, it is a runtime-specific map of which kernels, cache layouts, and parallelism schemes DeepSeek V4 needs to run well.

ShadowRadix and HiSparse

SGLang led with two serving primitives tailored to DeepSeek V4's attention design. lmsysorg's launch thread describes ShadowRadix as native prefix caching for V4's hybrid attention, while the same post says HiSparse extends sparse-attention KV onto CPU.

The concrete performance claim is attached to HiSparse, not the release as a whole. In the launch list, SGLang says the CPU-extended KV path can deliver up to 3 times higher long-context throughput.

That lines up with what the HN discussion summary pulled from early DeepSeek V4 readers: manifold-constrained hyper-connections and hybrid attention were already the architecture features people fixated on, so SGLang's first-class support for caching and sparse KV handling looks like the runtime side of that same design.

MegaMoE kernels

The kernel inventory is unusually specific for a point release. Across the launch thread and the follow-up, SGLang breaks the stack into distinct pieces:

W4A8 MegaMoE at launch, per lmsysorg's launch thread
Flash Compressor and Lightning TopK at launch, per the launch list
W4A4 MegaMoE added after launch, per lmsysorg's follow-up
Marlin and FlashInfer MXFP4, labeled W4A16 MoE on Hopper, per the follow-up
Faster KV Compression V2 and a fused SiLU plus clamp plus FP8 quantization kernel, per the follow-up
An optimized mHC pipeline using DeepGemm, fused norm, and fused hc_head, per the follow-up

That last item is a nice tell. The HN discussion summary had already surfaced mHC as one of the architectural ideas behind DeepSeek V4, and SGLang is explicitly naming an mHC serving pipeline rather than treating the model as a generic MoE drop-in.

Parallelism and hardware

SGLang also shipped DeepSeek V4 with a broad parallelism menu instead of a single blessed topology. lmsysorg's launch thread lists tensor parallelism, expert parallelism, context parallelism, data parallel attention, and prefill decode disaggregation.

The hardware matrix is similarly broad for day one: H100, H200, B200, B300, GB200, GB300, and MI35X all appear in lmsysorg's launch thread. The later update in lmsysorg's follow-up adds pipeline parallelism, TP16 support on H100 and H20, and one Docker image for all supported Nvidia hardware.

That combination makes the release read less like a model adapter and more like an attempt to make DeepSeek V4 portable across very different cluster shapes.

Blackwell MLA and speculative decoding

The post-launch additions are not just cleanup work. lmsysorg's follow-up calls out a TokenSpeed MLA attention backend on Blackwell with FP8 KV cache for low-latency MLA serving, plus continued work on speculative decoding.

The speculative side now includes:

MTP speculative decoding with in-graph metadata preparation at launch, per lmsysorg's launch thread
Adaptive Spec V2, per lmsysorg's follow-up
EAGLE-3 SWA and newer drafters, per the follow-up
Kimi K2.5 EAGLE-3 MLA and Gemma 3 and 4 plus EAGLE-3, per the follow-up

The same update also notes CUDA 13 DeepEP migration, a swap to deepseek-ai/DeepEP@hybrid-ep, and a FlashInfer pin to 0.6.11.post1, which is exactly the kind of dependency-level detail operators end up needing when a new serving path lands.

Model roster

v0.5.12 is also a broader model release than the DeepSeek V4 headline suggests. According to lmsysorg's model-support post, the new support list includes Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2 from Poolside, Ring-2.6-1T from InclusionAI, Gemma 4 MTP, and Trinity-mini.

The same post adds diffusion support for HunyuanVideo ModelOpt FP8 and Qwen Image ModelOpt FP8, and links the full release notes. A separate follow-up also says 35 new contributors landed in this release, which helps explain why the changelog reads more like a serving platform sweep than a narrow DeepSeek patch.