Skip to content
AI Primer

DeepSeek's family of language models exposed through its official API and documentation.

Pricing

Official site · May 14, 2026, 6:35 AM
Input / 1M
$0.14
Output / 1M
$0.28
Cached input / 1M
$0.07

Official API pricing page also lists DeepSeek-R1 separately. Batch/API discounts, if any, are not included here.

DeepSeek’s official pricing page publishes token-based API pricing in USD per 1M tokens. The page lists separate model rates; this record captures the primary text chat model pricing for DeepSeek-V3 (deepseek-chat).

View source

Model Intelligence

Arena ranking
28
Benchmarkable
No
Model level
family
Intelligence Index
28.1
Coding Index
28.4
Math Index
49.7
MMLU Pro
0.83
GPQA
0.74
HLE
0.06
LiveCodeBench
0.58
SciCode
0.37
AIME 2025
0.5
IFBench
0.38
LCR
0.45
TerminalBench Hard
0.24
TAU2
0.35

Recent stories

15 linked stories
newsPRIMARY2026-05-10
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

releaseSECONDARY2026-05-03
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

newsPRIMARY2026-05-02
Developers report DeepSeek V4 Flash handles 32M-token coding runs for $0.25

Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.

releasePRIMARY2026-04-30
DeepSeek removes visual-primitives repo after 90-KV vision details

DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.

releasePRIMARY2026-04-29
DeepSeek releases Vision beta for image understanding in DeepSeek Chat

DeepSeek began rolling out Vision beta as a new image-understanding mode in Chat, and early testers reported fast OCR and strong object recognition. The rollout appears limited or staggered, so watch for broader access and formal docs before relying on it.

releaseSECONDARY2026-04-27
vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades

vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.

workflowPRIMARY2026-04-26
DeepSeek V4 supports Anthropic-compatible routing into Claude Code and Cowork for ~90% lower cost

Independent guides showed DeepSeek V4 running inside Claude Cowork and Claude Code via Anthropic-compatible endpoints, and Ollama added launch commands for Claude-style wrappers. The workflow matters because teams can keep Claude-centered agent UX while sharply lowering model spend, with provider compatibility and setup still the main caveats.

newsPRIMARY2026-04-26
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.

newsPRIMARY2026-04-25
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

releaseSECONDARY2026-04-25
OpenClaw 2026.4.24 adds voice-call handoff and browser recovery

OpenClaw shipped a release that routes realtime voice queries to the full agent, defaults new users to V4 Flash, and adds coordinate clicks plus stale-lock recovery for browser automation. It also fixes Telegram, Slack, MCP session, and TTS issues, so update if those flows matter to your setup.

newsPRIMARY2026-04-25
DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

releasePRIMARY2026-04-24
DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

newsPRIMARY2026-04-24
DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.

releasePRIMARY2026-04-23
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

releaseSECONDARY2026-04-23
DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.