DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input
DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

TL;DR
- deepseek_ai's launch thread introduced two MIT-licensed open-weight models, DeepSeek-V4-Pro at 1.6T total and 49B active parameters, plus DeepSeek-V4-Flash at 284B total and 13B active parameters, both with a 1M token context window.
- According to deepseek_ai's API update, DeepSeek kept the base URL, added
deepseek-v4-proanddeepseek-v4-flash, exposed both through OpenAI ChatCompletions and Anthropic-compatible endpoints, and set a July 24 deprecation date fordeepseek-chatanddeepseek-reasoner. - scaling01's pricing screenshot and the DeepSeek API pricing page both show Flash at $0.14 per 1M input tokens on cache miss and $0.28 output, while Pro lands at $1.74 input and $3.48 output.
- On early third-party evals, ValsAI's Vibe Code Benchmark post put DeepSeek V4 at the top of its open-weight coding board, while arena's leaderboard post placed V4 Pro at #3 among open models in Code Arena and #2 among open models in Text Arena.
- Day-one inference support showed up fast: lmsysorg's SGLang post shipped serving and RL tooling, while vllm_project's implementation thread published a first-principles breakdown of how V4's long-context attention fits into vLLM.
You can read the full technical report, check the updated API docs, browse the Hugging Face collection, and then go straight to the implementation notes from vLLM or the SGLang cookbook. AiBattle_'s early API sighting and AiBattle_'s rollback note also show that parts of this rollout leaked into production before the official announcement landed.
What shipped
DeepSeek shipped two preview models across web, app, API, and open weights. Pro is the large model, Flash is the cheap one, and both share the same headline feature set.
- DeepSeek-V4-Pro: 1.6T total parameters, 49B active, 1M context, Expert mode on web and app DeepSeek's launch thread
- DeepSeek-V4-Flash: 284B total parameters, 13B active, 1M context, Instant mode on web and app DeepSeek's launch thread
- Both support thinking and non-thinking modes through the API DeepSeek's API update
- Both expose JSON output, tool calls, chat prefix completion, and FIM completion in non-thinking mode, per teortaxesTex's docs screenshot
- The old
deepseek-chatanddeepseek-reasonernames now route to Flash compatibility modes and retire on July 24, 2026, according to DeepSeek's API update
The release also landed under an MIT license on Hugging Face, which _akhaliq's Hugging Face screenshot and the collection page both surfaced within minutes.
Benchmarks and price shape
DeepSeek's own chart in the technical report positions V4-Pro-Max near Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini 3.1 Pro High on a mixed slate of reasoning and agentic benchmarks. The outside signal is narrower but more useful.
- Arena put V4 Pro Thinking at #3 among open models in Code Arena, and V4 Flash Thinking at #10 among open models in Text Arena arena's leaderboard post
- arena's Pareto chart showed V4 Flash Thinking on the text cost-performance frontier at roughly $0.25 blended per 1M tokens
- ValsAI said V4 became the #1 open-weight model on Vibe Code Benchmark, with ValsAI's follow-up saying V3.2 had been around 5% on the same benchmark versus just under 50% for V4
- Artificial Analysis' GDPval-AA post placed V4 Pro at the top of its open-weights leaderboard on agentic real-world work tasks, ahead of GLM-5.1, MiniMax-M2.7, and Kimi K2.6
Pricing is the part that forces the ecosystem to take this seriously.
- Flash: $0.028 cache-hit input, $0.14 cache-miss input, $0.28 output per 1M tokens scaling01's pricing screenshot
- Pro: $0.145 cache-hit input, $1.74 cache-miss input, $3.48 output per 1M tokens scaling01's pricing screenshot
- Both models list a 384K max output length in the API docs screenshot that teortaxesTex posted
That makes Flash the obvious volume model. Pro looks more like a frontier-adjacent reasoning SKU that DeepSeek priced aggressively but not carelessly.
Long-context architecture
The architecture change is the real story. DeepSeek says V4 moves to a hybrid attention stack built from Compressed Sparse Attention and Heavily Compressed Attention, plus mHC and Muon, all tuned around making 1M context usable instead of merely claimable.
According to the technical report and scaling01's report screenshot, the key claims are:
- CSA and HCA compress KV state for long-range retrieval.
- mHC strengthens residual-style signal propagation.
- Muon remains the optimizer backbone.
- At 1M context, V4-Pro needs 27% of V3.2's single-token inference FLOPs and 10% of its KV cache.
vLLM's implementation writeup unpacked that into serving mechanics:
- shared K and V with inverse RoPE for about 2x memory savings vLLM's architecture thread
- c4a and c128a KV compression for 4x to 128x savings vLLM's architecture thread
- sparse attention over compressed tokens plus a short sliding window for locality vLLM's architecture thread
- a per-layer KV state that vLLM said was about 8.7x smaller than a V3.2-style stack at 1M context vLLM's architecture thread
The practitioner thread from stochasticchasm's CSA post and the comparison diagrams from eliebakouch's CSA versus NSA post both read this as a new branch of DeepSeek's earlier sparse-attention work, not a cosmetic tweak.
Day-one ecosystem support
The rollout was unusually complete for an open model drop. DeepSeek released weights, updated the API, and got multiple inference stacks live on the same day.
- SGLang shipped native serving support, ShadowRadix prefix caching for the hybrid attention layout, HiSparse KV offload, and a verified Miles RL training pipeline lmsysorg's SGLang launch post
- vLLM published day-zero support plus recipes for Hopper and Blackwell deployments vLLM's architecture thread
- OpenRouter listed both Pro and Flash immediately, with model pages at Pro and Flash OpenRouter's availability post
- Ollama said it was working to bring Pro and Flash to its cloud offering ollama's cloud support post
- DeepSeek itself claimed official integration work with Claude Code, OpenClaw, and OpenCode DeepSeek's agent-optimization post
The SGLang diagram from lmsysorg's SGLang launch post is especially revealing because it turns the paper into an infra inventory: ShadowRadix, HiSparse, Flash Compressor, Lightning TopK, MegaMoE, multi-platform support, and a full DP/TP/CP plus EP stack are all already wired into the serving story.
Preview status and what is still constrained
DeepSeek keeps calling this a preview release, and the paper leaves more caveats than the benchmark charts suggest.
- The official announcement labels the models as preview versions DeepSeek's official reminder thread
- DeepSeek says V4-Pro throughput is currently limited by high-end compute constraints, with teortaxesTex's small-print screenshot quoting a note that Pro pricing should drop after Ascend 950 super nodes reach mass release in H2 2026
- The paper's future-work section says DeepSeek plans to simplify the architecture, study training-stability tricks like Anticipatory Routing and SwiGLU Clamping more systematically, and add multimodal capability later AiBattle_'s multimodal excerpt
- teortaxesTex's conclusion screenshot highlighted the report's own admission that V4 kept many validated tricks, which made the architecture "relatively complex"
That last bit is the cleanest summary of the whole drop. DeepSeek did not ship an elegant open model. It shipped a very engineered one, with enough price, context, and ecosystem support to make the mess everybody else's problem too.