Skip to content
AI Primer
update

DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.

5 min read
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

TL;DR

You can check DeepSeek's announcement page, skim the live Hacker News thread, and jump straight to practitioner comments on benchmark cost/performance, Flash vs Pro reliability, and active parameter efficiency.

Input cache pricing

The concrete change is simple: cache-hit input pricing across the series drops to 10 percent of launch levels, according to deepseek_ai's post. AiBattle_'s comparison graphic shows the same footnote in product terms, saying the older deepseek-chat and deepseek-reasoner names will eventually map to the non-thinking and thinking modes of deepseek-v4-flash.

For the currently advertised V4 models, the table in DeepSeek's graphic breaks out like this:

  • DeepSeek-V4-Pro, input cache hit: $0.003625 per 1M tokens during the promo, down from $0.0145 and far below the $0.145 launch reference.
  • DeepSeek-V4-Pro, input cache miss: $0.435 per 1M tokens during the promo, down from $1.74.
  • DeepSeek-V4-Pro, output: $0.87 per 1M tokens during the promo, down from $3.48.
  • DeepSeek-V4-Flash, input cache hit: $0.0028 per 1M tokens, down from $0.028.
  • DeepSeek-V4-Flash, input cache miss: $0.14 per 1M tokens.
  • DeepSeek-V4-Flash, output: $0.28 per 1M tokens.

The temporary part is the V4-Pro 75 percent promotion. The cache-hit cut is the permanent part, as victor207755822's post spells out.

Cache-heavy agent runs

The interesting bit is not the headline number. It is how quickly cache-hit tokens dominate long runs once an agent keeps reusing a swollen context.

In the usage snapshot from teortaxesTex, 6,199,808 of 6,612,473 total tokens were cache hits. That post prices the cache-hit portion at roughly $0.0224 after the cut, versus about $0.22 under the prior cache-hit price, and says the old bill would have been more than half the total run cost.

That lines up with the broader reaction in scaling01's post, which fixates on the new $0.003625 figure rather than the already-discounted miss and output rates. For agent workloads that repeatedly resend history, the cheap part of the bill just got much cheaper.

Flash and Pro in practice

Y
Hacker News

DeepSeek v4

2.1k upvotes · 1.6k comments

The strongest community signal is that V4-Flash looks like the easy model to love, while V4-Pro still carries serving friction. The summary in the main HN thread says commenters are reporting strong cost-performance and fast agentic behavior from Flash, while Pro is harder to serve and test because of rate limits and timeouts.

The thread's linked comments add the concrete examples:

  • In cmitsakis's benchmark comment, V4-Flash is described as scoring 90.2 percent on a customer-support benchmark, ahead of Qwen3.5-27B and roughly level with Gemini-3-Flash-Preview at much lower cost.
  • In gertlabs's hands-on comment, V4-Flash is called cheap, effective, and really fast, while V4-Pro is described as slow, unreliable, and too rate-limited to be very useful.
  • In wolttam's coding comment, a messy refactoring run reportedly ended with a significantly nicer codebase.
Y
Hacker News

Fresh discussion on DeepSeek v4

2.1k upvotes · 1.6k comments

The fresher roundup in impact_sy's delta report says the new comments mostly reinforce that same split. Flash is emerging as the practical deployment choice, while Pro is still the model people want to benchmark more than the one they want to sit behind a harness all day.

Cost structure behind the cut

One reason the price move looks less like a one-off promo is the infrastructure claim attached to it. In the interview excerpt captured by teortaxesTex, Liang Wenfeng says the company's pricing principle is not subsidy and not excessive margin, adding that the lower API price keeps a slight profit margin on top of cost.

The second screenshot in the same post ties that to V4's serving profile. It says DeepSeek-V4-Pro reaches 27 percent of V3.2's single-token FLOPs and 10 percent of its KV-cache size, while V4-Flash drops to 10 percent of the FLOPs and 7 percent of the KV-cache size.

That is also why the HN thread keeps returning to active parameters and served efficiency. In latentframe's comment on active parameters, the headline 1.6T parameter count matters less than how few parameters are active in practice. The cheaper cache-hit line item is the visible pricing change, but the more durable story is that DeepSeek is arguing the underlying memory bill fell first.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
Input cache pricing2 posts
Cache-heavy agent runs1 post