Skip to content
AI Primer
update

SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

3 min read
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

TL;DR

  • lmsysorg's launch thread says SGLang shipped Day-0 DeepSeek V4 support with measured throughput of 199 tok/s for V4 Pro on B200 and 266 tok/s for V4 Flash on H200 at 4K context.
  • The same launch thread says throughput stays high at 900K context, dropping to 180 tok/s on B200 and 240 tok/s on H200, which teortaxesTex's reaction post summarized as roughly a 10% loss.
  • According to lmsysorg's architecture summary, the long-context story hangs on a new ShadowRadix cache design, HiSparse CPU-extended KV, and a compressed-KV path built for hybrid sparse attention.
  • lmsysorg's breakdown also lists the kernel and serving stack explicitly: Flash Compressor, Lightning TopK, FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, DeepEP, and PD disaggregation.
  • On the training side, lmsysorg's thread and the LMSYS deep dive position Miles as a verified RL pipeline for V4 with DP, TP, SP, EP, PP, and CP parallelism plus FP8 training.

You can read the full LMSYS deep dive, skim lmsysorg's infographic post for the subsystem map, and the weirdest number in the package is still the 900K-context throughput staying close to the 4K baseline in the launch thread.

Throughput numbers

SGLang's post gives actual serving numbers, not just a compatibility badge.

The reported measurements are:

  • V4 Pro, 1.6T on B200: 199 tok/s at 4K context, 180 tok/s at 900K context
  • V4 Flash, 284B on H200: 266 tok/s at 4K context, 240 tok/s at 900K context

That makes the headline claim unusually concrete for a day-zero support post: near-million-token context with only a modest throughput drop.

ShadowRadix and HiSparse

The core inference reveal is not one trick, but a memory stack built around hybrid attention.

According to the architecture summary, the key pieces are:

  • ShadowRadix: native prefix caching for SWA and compressed KV pools
  • HiSparse: CPU memory extension for sparse-attention KV cache
  • Unified KV and paged compress pool: shared infrastructure for ShadowRadix, HiSparse, and FlashMLA
  • MTP speculative decoding: draft-decode metadata prepared inside the CUDA graph

The attached infographic in lmsysorg's post says HiSparse keeps throughput scaling after a baseline would plateau, which is the missing detail behind the 900K-context numbers.

Kernel stack and serving path

LMSYS split the serving path into kernels, parallelism, and deployment plumbing.

The named components are worth pulling out as a reference list:

  • Flash Compressor: fused compressor kernel, described in the infographic post as 10x faster than naive implementations
  • Lightning TopK: top-k indexing, listed in the infographic post at 15 microseconds for 1M context
  • FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC
  • Attention parallelism: DP, TP, CP
  • MoE parallelism: TP, EP, with DeepEP for large-scale EP
  • PD disaggregation: opaque-page transfer via shadow remap, per the infographic post

That is a much fuller serving map than the usual launch-day "works on our stack" claim.

Miles RL pipeline

The training side is almost a separate announcement.

LMSYS says Miles ships a verified RL pipeline for V4 with full-parallel training across DP, TP, SP, EP, PP, and CP, TileLang attention kernels, enhanced stability work, and FP8 training support. The infographic in lmsysorg's architecture summary adds one more concrete claim: a tensor-level checked precision path and an end-to-end growing reward curve.

That gives deployers both halves of the system map in one place, inference mechanics for long-context serving and a named RL stack for training at V4 scale.