releaseMay 10, 2026

DFlash adds Qwen3-8B speculator with 82.2% first-token acceptance

Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.

3 min read

DFlash adds Qwen3-8B speculator with 82.2% first-token acceptance

TL;DR

QuixiAI's repost of Red Hat AI said Qwen3-8B now has a DFlash speculator, with 82.2% first-token acceptance on math reasoning and 3.74 accepted tokens per step.
According to AlphaSignalAI's overview, DFlash replaces the usual autoregressive drafter with a lightweight block diffusion model that predicts multiple tokens in one forward pass.
baseten's benchmark post put the single-B200 comparison at roughly 3x speedup for DFlash versus roughly 2x for EAGLE on Qwen3-8B, with DFlash drafting 8 to 16 tokens in parallel per pass.
AlphaSignalAI's thread also claimed over 6x lossless acceleration overall, up to 2.5x faster than EAGLE-3, plus out-of-the-box support for Qwen, LLaMA, and Gemma backbones.

You can jump straight to the DFlash repo, check baseten's side-by-side speedup post for the single-GPU framing, and see QuixiAI's repost claim that DFlash is already running in a production inference stack.

Qwen3-8B speculator

The immediate ship here is not the whole DFlash paper story, it is a concrete speculator artifact for Qwen3-8B. The Red Hat AI repost via QuixiAI attached two numbers engineers actually use to judge speculative decoding quality: 82.2% first-token acceptance on math reasoning, and 3.74 average accepted tokens per step.

Those numbers matter because speculative decoding only pays off when the target model keeps accepting drafted tokens. baseten's comparison gives the operational version of that claim: on a single B200 with Qwen3-8B, DFlash reached about 3x speedup while EAGLE sat around 2x.

Block diffusion drafter

AlphaSignalAI's summary says the core design change is swapping the usual sequential drafter for a block diffusion model that predicts multiple tokens in one pass, while conditioning that drafter on context pulled from the target model.

That lines up with baseten's implementation note, which described EAGLE as one forward pass per token and DFlash as 8 to 16 tokens predicted in parallel per pass. In practice, the pitch is simple: speculative decoding already parallelizes verification, and DFlash tries to parallelize the drafting step too.

Backbones and runtimes

The broader rollout is already wider than one checkpoint. AlphaSignalAI's thread said DFlash works with vLLM, SGLang, and MLX, supports Qwen, LLaMA, and Gemma backbones out of the box, and is open sourced in the GitHub repo.

Another QuixiAI repost added that DFlash is already running in a production inference stack, with more draft models coming soon. That turns this from a neat decoding result into something closer to an engine component teams can drop into existing Qwen-serving paths.