Skip to content
AI Primer
release

DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

6 min read
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

TL;DR

  • deepseek_ai's launch thread introduced two MIT-licensed open-weight models, DeepSeek-V4-Pro at 1.6T total and 49B active parameters, plus DeepSeek-V4-Flash at 284B total and 13B active parameters, both with a 1M token context window.
  • According to deepseek_ai's API update, DeepSeek kept the base URL, added deepseek-v4-pro and deepseek-v4-flash, exposed both through OpenAI ChatCompletions and Anthropic-compatible endpoints, and set a July 24 deprecation date for deepseek-chat and deepseek-reasoner.
  • scaling01's pricing screenshot and the DeepSeek API pricing page both show Flash at $0.14 per 1M input tokens on cache miss and $0.28 output, while Pro lands at $1.74 input and $3.48 output.
  • On early third-party evals, ValsAI's Vibe Code Benchmark post put DeepSeek V4 at the top of its open-weight coding board, while arena's leaderboard post placed V4 Pro at #3 among open models in Code Arena and #2 among open models in Text Arena.
  • Day-one inference support showed up fast: lmsysorg's SGLang post shipped serving and RL tooling, while vllm_project's implementation thread published a first-principles breakdown of how V4's long-context attention fits into vLLM.

You can read the full technical report, check the updated API docs, browse the Hugging Face collection, and then go straight to the implementation notes from vLLM or the SGLang cookbook. AiBattle_'s early API sighting and AiBattle_'s rollback note also show that parts of this rollout leaked into production before the official announcement landed.

What shipped

DeepSeek shipped two preview models across web, app, API, and open weights. Pro is the large model, Flash is the cheap one, and both share the same headline feature set.

  • DeepSeek-V4-Pro: 1.6T total parameters, 49B active, 1M context, Expert mode on web and app DeepSeek's launch thread
  • DeepSeek-V4-Flash: 284B total parameters, 13B active, 1M context, Instant mode on web and app DeepSeek's launch thread
  • Both support thinking and non-thinking modes through the API DeepSeek's API update
  • Both expose JSON output, tool calls, chat prefix completion, and FIM completion in non-thinking mode, per teortaxesTex's docs screenshot
  • The old deepseek-chat and deepseek-reasoner names now route to Flash compatibility modes and retire on July 24, 2026, according to DeepSeek's API update

The release also landed under an MIT license on Hugging Face, which _akhaliq's Hugging Face screenshot and the collection page both surfaced within minutes.

Benchmarks and price shape

DeepSeek's own chart in the technical report positions V4-Pro-Max near Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini 3.1 Pro High on a mixed slate of reasoning and agentic benchmarks. The outside signal is narrower but more useful.

  • Arena put V4 Pro Thinking at #3 among open models in Code Arena, and V4 Flash Thinking at #10 among open models in Text Arena arena's leaderboard post
  • arena's Pareto chart showed V4 Flash Thinking on the text cost-performance frontier at roughly $0.25 blended per 1M tokens
  • ValsAI said V4 became the #1 open-weight model on Vibe Code Benchmark, with ValsAI's follow-up saying V3.2 had been around 5% on the same benchmark versus just under 50% for V4
  • Artificial Analysis' GDPval-AA post placed V4 Pro at the top of its open-weights leaderboard on agentic real-world work tasks, ahead of GLM-5.1, MiniMax-M2.7, and Kimi K2.6

Pricing is the part that forces the ecosystem to take this seriously.

That makes Flash the obvious volume model. Pro looks more like a frontier-adjacent reasoning SKU that DeepSeek priced aggressively but not carelessly.

Long-context architecture

The architecture change is the real story. DeepSeek says V4 moves to a hybrid attention stack built from Compressed Sparse Attention and Heavily Compressed Attention, plus mHC and Muon, all tuned around making 1M context usable instead of merely claimable.

According to the technical report and scaling01's report screenshot, the key claims are:

  1. CSA and HCA compress KV state for long-range retrieval.
  2. mHC strengthens residual-style signal propagation.
  3. Muon remains the optimizer backbone.
  4. At 1M context, V4-Pro needs 27% of V3.2's single-token inference FLOPs and 10% of its KV cache.

vLLM's implementation writeup unpacked that into serving mechanics:

The practitioner thread from stochasticchasm's CSA post and the comparison diagrams from eliebakouch's CSA versus NSA post both read this as a new branch of DeepSeek's earlier sparse-attention work, not a cosmetic tweak.

Day-one ecosystem support

The rollout was unusually complete for an open model drop. DeepSeek released weights, updated the API, and got multiple inference stacks live on the same day.

The SGLang diagram from lmsysorg's SGLang launch post is especially revealing because it turns the paper into an infra inventory: ShadowRadix, HiSparse, Flash Compressor, Lightning TopK, MegaMoE, multi-platform support, and a full DP/TP/CP plus EP stack are all already wired into the serving story.

Preview status and what is still constrained

DeepSeek keeps calling this a preview release, and the paper leaves more caveats than the benchmark charts suggest.

  • The official announcement labels the models as preview versions DeepSeek's official reminder thread
  • DeepSeek says V4-Pro throughput is currently limited by high-end compute constraints, with teortaxesTex's small-print screenshot quoting a note that Pro pricing should drop after Ascend 950 super nodes reach mass release in H2 2026
  • The paper's future-work section says DeepSeek plans to simplify the architecture, study training-stability tricks like Anticipatory Routing and SwiGLU Clamping more systematically, and add multimodal capability later AiBattle_'s multimodal excerpt
  • teortaxesTex's conclusion screenshot highlighted the report's own admission that V4 kept many validated tricks, which made the architecture "relatively complex"

That last bit is the cleanest summary of the whole drop. DeepSeek did not ship an elegant open model. It shipped a very engineered one, with enough price, context, and ecosystem support to make the mess everybody else's problem too.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 6 threads
TL;DR3 posts
What shipped2 posts
Benchmarks and price shape5 posts
Long-context architecture3 posts
Day-one ecosystem support2 posts
Preview status and what is still constrained2 posts