Skip to content
AI Primer
release

DeepSeek releases DSpark checkpoints for Qwen3 and Gemma-4

DeepSeek extended DSpark beyond V4 by publishing draft-model checkpoints for Qwen3 and Gemma-4 families and clarifying that DSpark targets higher-throughput serving by controlling verification cost. The release matters because speculative decoding is moving from papers into reusable open checkpoints.

7 min read
DeepSeek releases DSpark checkpoints for Qwen3 and Gemma-4
DeepSeek releases DSpark checkpoints for Qwen3 and Gemma-4

TL;DR

  • DeepSeek published DeepSpec on GitHub, the DSpark paper PDF, and reusable draft-model checkpoints that [cedric_chee's repo roundup](src:0|cedric_chee's repo roundup) surfaced for Qwen3 and Gemma-4 families.
  • The release is about serving, not model quality inflation: DeepSeek-V4-Pro-DSpark's model card, shown in [teortaxesTex's screenshot](src:42|teortaxesTex's release thread), says DSpark is the same V4 checkpoint with an extra speculative decoding module attached.
  • DSpark combines a parallel drafter with a tiny sequential head, then uses confidence scores to decide how much of the draft is worth verifying, as [rohanpaul_ai's walkthrough](src:27|rohanpaul_ai's architecture explainer) and [the paper screenshots in teortaxesTex's thread](src:53|teortaxesTex's paper thread) both show.
  • DeepSeek's headline gains came from live serving: according to [ZhihuFrontier's summary](src:2|ZhihuFrontier's release breakdown), V4-Flash saw 51% higher throughput at an 80 tok/s SLA, while [the production charts in teortaxesTex's post](src:7|teortaxesTex's production chart) also show much larger relative gains under stricter settings.
  • The interesting new part is portability. Beyond V4, [cedric_chee's screenshot of the repos](src:0|cedric_chee's repo roundup) and [teortaxesTex's follow-up](src:4|teortaxesTex's checkpoint reaction) show DSpark drafters for Qwen3 4B, 8B, 14B, plus Gemma4 12B, which turns a serving trick into an open checkpoint set.

You can browse the GitHub repo, read the full paper, and open the Hugging Face V4-Pro-DSpark card. The weirdly useful bit is that the public release is not only a paper or code drop: [cedric_chee's initial post](src:0|cedric_chee's repo roundup) caught actual drafter repos for external model families, while [haoailab's reply](src:6|haoailab's complementary note) frames DSpark as a high-concurrency serving design that could even pair with other low-latency drafters.

What shipped

The release has three distinct pieces:

  • DeepSpec on GitHub, which DeepSeek describes as a codebase for training and evaluating speculative decoding draft models, according to [teortaxesTex's release thread](src:42|teortaxesTex's release thread)
  • DSpark paper PDF, titled DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation, linked in [rohanpaul_ai's source post](src:29|rohanpaul_ai's source links)
  • Published drafter checkpoints, which [cedric_chee's screenshot](src:0|cedric_chee's repo roundup) and [teortaxesTex's screenshot](src:4|teortaxesTex's checkpoint reaction) show as:

The model-card caveat matters. In the DeepSeek-V4-Pro-DSpark model card, captured in [teortaxesTex's screenshot](src:42|teortaxesTex's release thread), DeepSeek says V4-Pro-DSpark is not a new model, but the same checkpoint with an additional speculative decoding module attached.

Semi-autoregressive drafting

DSpark sits between the two usual speculative-decoding tradeoffs that [ZhihuFrontier's comparison chart](src:3|ZhihuFrontier's tradeoff summary) spells out.

  • Eagle3 style drafting keeps token dependencies, but draft latency grows with every extra token.
  • DFlash style drafting predicts a whole block in parallel, but later positions drift because they do not know what earlier sampled tokens became.
  • DSpark does both: a heavy parallel block drafts quickly, then a lightweight sequential block repairs local token dependencies before verification.

The sequential block has two variants in the paper excerpts that [teortaxesTex posted](src:53|teortaxesTex's paper thread):

  • Markov head: conditions each position on the immediately previous token with a low-rank transition matrix.
  • RNN head: carries a small recurrent state across the draft block.

DeepSeek's default lightweight choice appears to be the Markov path. In the training section that [teortaxesTex screenshotted](src:37|teortaxesTex's scheduler and training screenshots), the released setup uses three MoE layers, sliding-window attention of 128, maximum block size γ = 5, and the Markov head for sequential modeling.

Confidence-scheduled verification

The better systems idea is not the draft alone. It is the scheduler.

Instead of verifying a fixed-length draft every round, DSpark scores each drafted position with a confidence estimate, then keeps only the prefix that looks worth the target-model compute. [rohanpaul_ai's plain-English example](src:27|rohanpaul_ai's architecture explainer) describes the loop as keeping E, F, and G, dropping risky H, then letting the target model accept or reject the kept prefix.

The paper details in [teortaxesTex's screenshots](src:53|teortaxesTex's paper thread) add three implementation details that are easy to miss in summary threads:

  • the confidence head predicts conditional acceptance probability for each draft position
  • DeepSeek applies Sequential Temperature Scaling, or STS, so cumulative acceptance estimates are calibrated rather than just rank-ordered
  • the production scheduler is asynchronous, using a two-step-old capacity estimate so it can work with CUDA graph replay and Zero-Overhead Scheduling instead of stalling the GPU pipeline

That last point is catnip for inference nerds. The algorithm in the paper is one thing, but [the production section in teortaxesTex's screenshot](src:37|teortaxesTex's scheduler and training screenshots) says the deployed version was adapted around real serving constraints like jagged capacity curves and precomputed batch scheduling.

Production numbers

DeepSeek compared DSpark against its prior MTP-1 production baseline, not against plain greedy decoding.

The public numbers that [ZhihuFrontier summarized](src:2|ZhihuFrontier's release breakdown) and [the paper charts in teortaxesTex's screenshots](src:7|teortaxesTex's production chart) repeat are:

  • V4-Flash: 51% higher aggregate throughput at 80 tok/s per user
  • V4-Flash: up to 661% higher throughput under a stricter interactivity target where the baseline falls into a low-concurrency regime
  • V4-Flash: 60% to 85% faster per-user generation at matched system capacity
  • V4-Pro: 52% higher aggregate throughput at 35 tok/s per user
  • V4-Pro: up to 406% higher throughput under stricter SLA settings
  • V4-Pro: 57% to 78% faster per-user generation at matched system capacity

Practitioner screenshots lined up with the direction, if not the paper's exact setup. In [cedric_chee's before-and-after test](src:26|cedric_chee's speed comparison), a coding prompt dropped from 214 seconds of thinking time to 116 seconds, and [cedric_chee later noted](src:25|cedric_chee's quality caveat) that the quick vibe check had not shown output-quality regressions so far.

One caveat came from [eliebakouch's read of the paper](src:55|eliebakouch's deep read): DeepSeek did not publish equivalent live production numbers for DFlash or Eagle baselines, only for MTP-1.

Qwen3 and Gemma-4 portability

r/LocalLLaMA

deepseek-ai/DeepSeek-V4-Pro-DSpark • Huggingface

0 comments

The most durable part of this drop may be the checkpoints outside DeepSeek's own flagship models.

According to [cedric_chee's repo list](src:0|cedric_chee's repo roundup), DeepSeek shipped DSpark drafters for Qwen3 4B, 8B, and 14B, plus Gemma4 12B. [teortaxesTex's reaction](src:4|teortaxesTex's checkpoint reaction) immediately singled out the Gemma4 12B version as the interesting local-model target, and the [LocalLLaMA post](src:56|LocalLLaMA's link post) quickly turned the Hugging Face card and paper into a community reference point.

There are already signs people are treating the release as modular infrastructure, not a one-off DeepSeek artifact. In [haoailab's reply](src:5|haoailab's latency-vs-throughput distinction), JetSpec's authors contrasted DSpark's goal, high-throughput serving via verification-budget control, with JetSpec's focus on low-latency generation; then [haoailab added](src:6|haoailab's complementary note) that the two approaches look complementary if you swap in a JetSpec-style backbone and keep DSpark's sequential head for concurrency control.

That makes this more than a benchmark paper. It is an open checkpoint set for people experimenting with serving stacks on non-DeepSeek bases.

Training data and serving overhead

Two final details add texture to the release.

First, the training data was not hidden. [maximelabonne's note](src:23|maximelabonne's dataset note) matches the paper excerpt showing DSpark drafters were trained from prompts drawn from Open-PerfectBlend, a 1.3 million sample mix spanning math, code, chat, and instruction-following data.

Second, the speculative module is not free in memory. [teortaxesTex's size note](src:33|teortaxesTex's module-size note) says DSpark adds roughly 7 GB for Flash and 27 GB for Pro. That is a useful reminder that open speculative decoding is now a packaging problem too, not only an algorithm problem.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR1 post
What shipped1 post
Production numbers3 posts
Qwen3 and Gemma-4 portability2 posts
Share on X