Skip to content
AI Primer
release

DeepSeek V4-Pro benchmarks at ~90 tok/s after DSpark rollout

Independent measurements after DSpark put DeepSeek V4-Pro around 90 tok/s and cut one run from 214s to 116s. The gain matters because it lowers serving cost, though tuning details and memory overhead are still unclear.

5 min read
DeepSeek V4-Pro benchmarks at ~90 tok/s after DSpark rollout
DeepSeek V4-Pro benchmarks at ~90 tok/s after DSpark rollout

TL;DR

You can read the DeepSpec GitHub repo, check the V4-Pro-DSpark model card, and trace one of the weirder buried details in High-Flyer's HAI-LLM training framework post. The paper also ties its public experiments to Open-PerfectBlend, which maximelabonne's note spotted almost immediately.

Benchmarks

The cleanest external datapoint is teortaxesTex's speed measurement. It shows V4-Pro at 3274 ms to first token and 89 tok/s, with about 21.9K total tokens in the run.

The same thread includes a Flash run at 1859 ms to first token and 131 tok/s teortaxesTex's Flash measurement. Separately, cedric_chee's timing comparison posted a coding task that went from a 214 second thinking block to 116 seconds after the DSpark update.


These are not controlled benchmarks, but they do line up with DeepSeek's own live-traffic framing. In the paper screenshots shared by eliebakouch's paper screenshot, the V4-Flash and V4-Pro frontier plots show DSpark moving both throughput and per-user generation speed relative to the MTP-1 baseline.

DSpark

DSpark has three moving parts, according to rohanpaul_ai's explainer and the architecture figure in rohanpaul_ai's architecture sketch:

  1. A heavy parallel draft model proposes several next tokens at once.
  2. A lightweight sequential stage, usually a Markov head, adjusts each proposal using the previous sampled token.
  3. A confidence scheduler keeps only the prefix worth verifying, then drops the risky suffix.

That third step is the whole trick. The DeepSpec GitHub repo and screenshots of the paper shared in teortaxesTex's calibration screenshot show the project spending real effort on confidence calibration, because fixed-length verification can waste GPU time checking tokens that are likely to be rejected anyway.

The paper's implementation details are unusually concrete. Screenshots in teortaxesTex's scheduler details describe a three-layer MoE parallel backbone, a maximum block size of 5, sliding-window attention of 128, and an asynchronous scheduler that uses confidence predictions from two steps earlier so it can coexist with CUDA graph replay and Zero-Overhead Scheduling.

What shipped

DeepSeek shipped two public artifacts together:

The model card matters because it answers an easy point of confusion. It explicitly says V4-Pro-DSpark is not a new model, only the same checkpoint with an additional speculative decoding module attached, which matches teortaxesTex's clarification.

DeepSeek's own headline numbers, echoed by Yuchenj_UW's post, teortaxesTex's release summary, and eliebakouch's paper screenshot, frame the gain as +51% to +400% throughput improvements depending on the operating point, with live-traffic plots for both Flash and Pro.

Serving economics

The strongest cost angle comes from comparing DeepSeek's newly disclosed paper figures with its older Open Source Week economics. teortaxesTex's economics post points out that the earlier public figure was about 14.8K output tok/s on an 8xH800 node at 20 to 22 tok/s user speed, while the new V4-Pro plots imply far higher per-GPU generation rates.

Using those paper plots plus SemiAnalysis spot-price screenshots, teortaxesTex's margin estimate sketches a rough back-of-the-envelope of about 5 million tokens per hour per GPU and about $4.4 of hourly output value at $0.87 per million tokens, against roughly $2.42 per hour H100 spot pricing. The post is explicitly reductionist, but it explains why a decoding upgrade can move margins without any model-quality change.

The official source is still better on the qualitative point than the exact margin math. DeepSeek's own live-traffic figure, captured in teortaxesTex's paper economics screenshot, shows DSpark shifting the throughput versus interactivity frontier, which is the direct systems claim behind all the cost speculation.

Training and deployment gotchas

Some of the most useful details are buried in side notes, not the headline chart.

First, DSpark is not free in memory. teortaxesTex's memory overhead note says the extra module adds about 7 GB for Flash and 27 GB for Pro.

Second, the public experiments were trained on prompt-only data regenerated by each target model. maximelabonne's note spotted that the paper uses the Open-PerfectBlend dataset, and the screenshot attached to that post breaks the mix down as 39.4% math, 38.9% code, 17.6% chat, and 4.1% instruction-following.

Third, the training framework is doing systems work that most open speculative-decoding demos skip. The scheduler and training screenshots in teortaxesTex's scheduler details mention hidden-state communication to avoid shipping full-vocabulary logits between workers, plus anchor-bounded sequence packing to keep drafter cost decoupled from target context length. Those choices point back to High-Flyer's HAI-LLM training framework post, which the paper cites directly.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 6 threads
TL;DR2 posts
Benchmarks2 posts
DSpark1 post
What shipped3 posts
Serving economics2 posts
Training and deployment gotchas2 posts
Share on X