DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5
DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

TL;DR
- DeepSeek's discount post cut DeepSeek-V4-Pro pricing by 75% through May 5, dropping cache-hit input to $0.03625, cache-miss input to $0.435, and output to $0.87 per 1M tokens.
- According to DeepSeek's integration note, the discounted Pro model also picked up explicit setup guidance for Claude Code, OpenCode, and OpenClaw, including a
deepseek-v4-pro[1m]model string for Claude Code. - Dillon Uzar's Context Arena run put V4 Pro close to GPT-5.4 medium at 128K context, but the same evaluation had Pro and Flash trailing GPT-5.4 and Gemini 3 Flash by 512K.
- the main HN thread and the fresh HN delta both described the practical split the launch created: V4-Flash looked attractive on speed and cost, while V4-Pro drew more complaints about rate limits and timeouts.
- LMSYS's day-zero support thread says the engineering push behind V4 was not just model scale, but a new long-context stack spanning hybrid sparse attention, CPU-extended KV, FP4 experts, and multi-hardware deployment.
Pricing and compatibility
You can browse the official pricing and model table, check DeepSeek's launch note for V4, and read LMSYS's systems deep dive. The odd bit is how much of the real rollout story lives in setup details: an Anthropic-format endpoint, renamed legacy model slugs, and explicit client version floors for agent tools.
The discount is straightforward. DeepSeek's post advertised a 75% cut until May 5, and the pricing table screenshot showed the discounted V4-Pro rates alongside the undiscounted yuan-denominated list price.
The compatibility story is more interesting than the promo art. According to the model details table, V4 Pro is exposed through an Anthropic-format base URL at https://api.deepseek.com/anthropic, while DeepSeek's post called out three specific client updates:
- Claude Code: set model to
deepseek-v4-pro[1m] - OpenCode: update to v1.14.24+
- OpenClaw: update to v2026.4.24+
the same table also says both V4 models expose a 1M context window, while V4 Pro raises maximum output length to 384K. The footnote matters for anyone with old integrations: deepseek-chat and deepseek-reasoner are being deprecated and mapped to V4 Flash non-reasoning and reasoning modes for compatibility.
Benchmarks and token economics
Dillon Uzar's benchmark thread is the cleanest external read on long-context behavior in the evidence pool. On GDM-MRCRv2 at 128K, V4 Pro reasoning scored 68.2% AUC versus GPT-5.4 medium at 69.2%, while V4 Flash reasoning landed at 62.6%.
That gap widened by 512K. In the same Context Arena run, V4 Pro dropped to 28.3% AUC and V4 Flash to 25.4%, behind GPT-5.4 medium at 36.9% and Gemini 3 Flash high at 35.8%.
The token bill is the catch. Hangsiin's KSST table showed V4 Pro averaging 71,789 tokens and V4 Flash 66,978 tokens on that test, and the cross-model comparison placed V4 Pro below several U.S. frontier systems while using far more tokens than Gemini 3 Flash high.
teortaxesTex's pricing comment framed the discounted $0.43 and $0.87 per 1M token rates as close to breakeven, which matches the broader pattern in the KSST results: the models can score well, but they do it with very heavy token consumption.
Flash and Pro in real workflows
DeepSeek v4
2k upvotes · 1.6k comments
Fresh discussion on DeepSeek v4
2k upvotes · 1.6k comments
The community read split quickly into two products. According to the main HN thread, V4 looked stronger in coding and agent workflows than its headline benchmark positioning suggested, especially when people compared Flash's speed and price against Pro's slower, rate-limited behavior.
The most repeated points in the HN thread and the fresh delta were:
- Flash looked like the easier deploy for day-one use
- Pro was harder to evaluate because of rate limits and timeouts
- some harnesses needed special prompting or a
reasoning_contentfield - anecdotal coding performance looked better than simple reasoning tests implied
One small but concrete example came from Niall O'Higgins's OpenClaw post, which said V4 Pro one-shotted a custom VM management app. That lines up with the HN delta, where one commenter said V4 handled a difficult refactoring task better than Kimi 2.6 and GLM 5.1 in their harness, even as another pointed to failures on simpler arithmetic and counting tasks.
The 1M-context systems stack
LMSYS's thread is the best explanation for why DeepSeek can advertise 1M context without telling a pure scaling story. It listed a full systems stack behind V4 Pro and V4 Flash, then backed it with throughput numbers of 199 tok/s on B200 for Pro, 266 tok/s on H200 for Flash, and 180 tok/s and 240 tok/s respectively even at 900K context.
The stack in the LMSYS breakdown breaks into four layers:
- Caching and attention: ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap.
- Kernels and deployment: FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC, DP/TP/CP attention, DeepEP MoE, PD disaggregation.
- Training: DP, TP, SP, EP, PP, and CP parallelism, plus TileLang attention, stability work, and FP8 training.
- Hardware targets: NVIDIA Hopper, Blackwell, Grace Blackwell, AMD, and NPU support.
the charted model table added another useful detail. Its long-context graphs claimed V4 Pro cuts single-token FLOPS by 9.8x versus V3.2 at 1024K, while accumulated KV cache shrinks by 13.7x, which is the kind of systems gain you need before a 1M window becomes more than brochure text.
Ascend and the China hardware angle
The most unusual part of the rollout is that V4 did not stay in the usual NVIDIA-only framing. the Ascend support thread claimed day-zero compatibility across Ascend A2, A3, and 950 hardware, plus a CPT adaptation path for V4 Flash on A3 64-card clusters.
Some of the numbers in that thread are aggressive enough to deserve cautious reading, but they are specific: about 20ms low-latency inference for V4 Pro on Ascend 950, about 10ms for V4 Flash, and 2000+ TPS single-card decode throughput for V4 Flash under 8K input on Ascend A3 SuperPod hardware.
teortaxesTex's follow-up pushed the hardware story further by noting that Huawei's Ascend 950DT, the variant described as optimized for decode and training, had volume production scheduled for late 2026. If that reading is right, DeepSeek is not only shipping another open model family, but helping surface one of the earliest real-world training and inference customers for a relatively modern general-purpose Chinese NPU stack.