releaseJune 15, 2026

SGLang adds DFlash and Spec V2 with 4.3x Qwen3.5-397B-A17B throughput

LMSYS and Modal shipped DFlash plus Spec V2 in SGLang, claiming 4.3x baseline throughput and 1.5x native MTP on Qwen3.5-397B-A17B. It cuts latency and serving cost for very large open models.

4 min read

SGLang adds DFlash and Spec V2 with 4.3x Qwen3.5-397B-A17B throughput

TL;DR

lmsysorg's launch thread says DFlash plus Spec V2 is now SGLang's default speculative decoding path, with more than 4.3x baseline throughput and 1.5x native MTP throughput on Qwen3.5-397B-A17B.
The draft model is doing more than standard next-token drafting: lmsysorg's launch thread describes block diffusion that generates a full token block in one forward pass, plus KV injection to raise acceptance.
Spec V2's overlap scheduler is a separate gain on top of the drafter, adding about 33% end-to-end throughput in lmsysorg's launch thread on Qwen3-8B at concurrency 32.
modal's Hugging Face reply says the collaborators published identical drafter weights and benchmark reproduction scripts, which makes this one of the more inspectable inference-speed launches.

You can jump straight from modal's LMSYS blog link to the official writeup, inspect the Hugging Face drafter release, and see that the launch is really two changes bundled together: a new draft model design and a scheduler that hides host-device sync overhead. The Hugging Face release also exposes a tuning detail that did not fit in the tweets, block size 8 for higher concurrency and block size 16 for longer accept lengths.

DFlash throughput

The headline number is tied to one concrete setup: HumanEval, concurrency 1, on 8x B200s. In that configuration, modal's benchmark thread and lmsysorg's launch thread both put DFlash at more than 4.3x baseline throughput and 1.5x native MTP.

The Hugging Face release behind modal's Hugging Face reply fills in the wider shape of the results. It describes DFlash as a draft model paired with Qwen3.5-397B-A17B inside a speculative decoding server, not a standalone checkpoint, and reports gains across concurrency 1 through 32 and workloads including GSM8K, MATH500, HumanEval, MBPP, and MT-Bench.

Block diffusion and KV injection

LMSYS boiled the new drafter down to two mechanics:

Block diffusion: one forward pass drafts a full token block instead of a single next token, according to lmsysorg's launch thread.
KV injection: target-model features are fed into every draft layer's KV cache, again per lmsysorg's launch thread, to improve acceptance.

The official blog linked by modal's LMSYS blog link adds the more interesting detail: KV injection is the reason this is not just another smaller drafter. The target model's hidden representations are projected into the draft path so the drafter does less redundant context modeling and spends more of its budget on proposing the next block.

Spec V2 overlap scheduler

DFlash is only half the ship. lmsysorg's launch thread also says SGLang's Spec V2 overlap scheduler contributed a 33% end-to-end lift.

The LMSYS blog linked by modal's LMSYS blog link gives the cleaner explanation. Spec V2 reduces host-device synchronization points, then overlaps host-side work like batch cleanup and KV allocation with GPU work from the adjacent batch. The example in the blog moves Qwen3-8B from roughly 11.4k tokens per second to 15.3k at concurrency 32 on one B200.

That split matters because the launch bundles an algorithmic change with a systems change. One part drafts better, the other wastes less time around the draft loop.

Weights, scripts, and tuning knobs

The collaborators published identical drafter weights on Hugging Face, according to modal's Hugging Face reply, along with scripts that reproduce the benchmark against MTP. That is unusually useful for a serving-speed claim, where the missing details are usually the whole story.

The Hugging Face page linked in modal's Hugging Face reply also exposes the first practical tuning knob. It recommends block size 8 for higher concurrency, while block size 16 gets longer accept lengths and stronger concurrency-1 throughput. It also lists the benchmark stack: 8x NVIDIA B200s, tensor parallelism, continuous batching, and five runs per config.

For SGLang users, lmsysorg's launch thread says DFlash plus Spec V2 is already the default speculative decoding engine. The release is not just a paper result, it shipped as the default path in the serving stack.