DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context
Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

TL;DR
- DeepSeek shipped two MIT-licensed open-weight models, V4 Pro at 1.6T total and 49B active parameters, and V4 Flash at 284B total and 13B active, with 1M-token context on both, according to iScienceLuvr's release post and ben_burtenshaw's headline numbers.
- The architecture change is the real story: DeepSeek says V4 Pro needs 27% of V3.2's single-token inference FLOPs and 10% of its KV cache at 1M context, while ben_burtenshaw's attention summary and nrehiew_'s architecture read unpack that as a CSA plus HCA hybrid attention stack.
- Early external scorecards were strong but not cleanly dominant: Artificial Analysis placed V4 Pro second among open weights on its Intelligence Index and first on GDPval-AA, while bridgemindai's Code Arena screenshot showed only a #14 debut on LMArena Code.
- Flash drew the most enthusiasm from practitioners, because the main HN thread highlighted reports that it was cheap, fast, and more usable than Pro under launch-day limits, and arena's Pareto chart put Flash Thinking on the text price-performance frontier.
- Day-one ecosystem support was unusually broad, with vLLM's launch thread detailing a custom hybrid KV implementation, SGLang's day-zero post shipping native serving and RL support, and badlogicgames' compatibility report surfacing an API edge case around missing
reasoning_content.
You can read the official paper PDF, browse the model collection on Hugging Face, check DeepSeek's launch announcement and API notes, skim Simon Willison's early hands-on writeup, and compare the day-one serving work in the vLLM thread and SGLang cookbook. The weird bit is that launch-day discourse split fast: Artificial Analysis liked Pro's open-weight standing, bridgemindai did not, and Hacker News commenters kept steering people back to Flash.
What shipped
DeepSeek's preview release is a two-model lineup, not a single flagship. iScienceLuvr's release post
The official package, visible in the Hugging Face collection and echoed by thursdai_pod's launch thread, breaks down like this:
- V4 Pro: 1.6T total params, 49B active, 1M context
- V4 Flash: 284B total params, 13B active, 1M context
- License: MIT
- Modes: thinking and non-thinking variants
- Pricing: Pro at $1.74 input and $3.48 output per 1M tokens, Flash at $0.14 input and $0.28 output per 1M tokens, per thursdai_pod's pricing screenshot
A smaller but important product change sits in thursdai_pod's follow-up: deepseek-chat and deepseek-reasoner are being deprecated and mapped to V4 Flash non-thinking and thinking for compatibility. That makes Flash look like the default traffic path, not a sidecar model.
CSA and HCA
The paper frames V4 as an answer to long-context cost, and the two new attention modes are the mechanism. iScienceLuvr's paper screenshot
Across ben_burtenshaw's summary, nrehiew_'s CSA walkthrough, and nrehiew_'s HCA note, the stack resolves into a few concrete pieces:
- CSA, Compressed Sparse Attention: compresses KV blocks at 4x, then runs top-k sparse selection over the compressed stream.
- HCA, Heavily Compressed Attention: compresses much more aggressively, at 128x, and skips the selector.
- Interleaving across layers: V4 alternates less sparse and more sparse attention instead of pretending every layer wants the same memory pattern.
- Sliding-window locality: recent tokens still get direct local treatment so the compressor does not erase nearby detail.
- MQA over compressed KV: several readers, including stochasticchasm's MQA note and eliebakouch's MQA note, flagged this as one of the stranger but more elegant parts of the design.
The payoff is the chart DeepSeek wanted everyone to see. At 1M context, ben_burtenshaw's efficiency post says V4 Pro falls to 27% of V3.2's single-token FLOPs and 10% of its KV cache, while V4 Flash drops to 10% of FLOPs and 7% of KV cache.
Serving the cache
The architecture change only matters if somebody can actually serve it. That is why the day-zero infra posts were almost as interesting as the launch itself.
According to vLLM's implementation notes, supporting V4 in vLLM required:
- a unified hybrid KV cache across multiple compression rates
- page-size bucketing for a five-way cache stack
- fused kernels for compression, RoPE, quantization, and cache insert
- multistream overlap between indexer work, main-KV compression, and sliding-window insertion
SGLang took a similar angle but emphasized serving features around the architecture: its launch post called out ShadowRadix prefix caching for compressed KV pools, a Flash Compressor kernel, a Lightning TopK indexer, HiSparse for CPU-backed sparse-attention cache extension, and a verified Miles RL pipeline.
That breadth helps explain why some launch-day reactions treated V4 as an infra paper wearing a model release badge. ben_burtenshaw's infra note called the deterministic kernel work alone paper-worthy, and stochasticchasm's final impressions said the standout was how much engineering got compressed into a few report pages.
External scorecards
Vendor charts were flattering, but the outside picture was more useful.
The cleanest third-party snapshot came from Artificial Analysis, which reported that V4 Pro scored 52 on its Intelligence Index, up from 42 for V3.2, and landed second among open-weight reasoning models behind Kimi K2.6. The same post put V4 Pro at 1554 on GDPval-AA, ahead of Kimi K2.6, GLM-5.1, GLM-5, and MiniMax-M2.7.
Artificial Analysis also added two caveats that got less attention in the hype cycle:
- V4 Pro cost $1,071 to run its index, versus $71 for V3.2, largely because Artificial Analysis' cost breakdown measured 190M output tokens for Pro and 240M for Flash.
- V4 Pro and Flash both posted very high hallucination rates, 94% and 96% respectively, in Artificial Analysis' AA-Omniscience chart.
Other evaluators leaned more positive on coding. Vals AI said V4 became its top open-weight model on Vibe Code Benchmark, and Vals AI's follow-up claimed #1 positions on SWE Bench, LiveCodeBench, IOI, and Vibe Code Bench inside its own stack.
Mixed reality on Pro versus Flash
The most consistent launch-day argument was not whether V4 was good. It was which V4 mattered.
DeepSeek v4
1.9k upvotes · 1.5k comments
In the main HN thread, one of the most upvoted practitioner comments called Flash the model to watch because it was cheap, effective, and fast, while Pro was described as slow, unreliable, and too rate-limited to be very useful. Another HN comment in the same thread questioned whether the official benchmarks matched what third-party testers were seeing.
That split showed up elsewhere. bridgemindai's Code Arena screenshot placed DeepSeek V4 Pro at #14 on LMArena Code, and bridgemindai's BridgeBench repost said V4 Pro ranked last on BridgeBench. On the other side, arena's Pareto post argued that V4 Flash Thinking shifted the text price-performance frontier, while arena's Pro versus Flash comparison said Pro ranked about 30 places higher than Flash variants at roughly 12x the launch price.
The strongest day-one reading is not that the benchmarks were wrong. It is that Flash had a clearer product story. Pro arrived as a showcase model with obvious throughput constraints, and Flash arrived as a budget agent model people could actually picture using.
Ecosystem rollout
DeepSeek did not wait for the ecosystem to catch up. The ecosystem was part of the launch.
By the end of day one, the evidence pool showed support across hosted inference, local runtimes, and developer tooling:
- AskVenice put Pro and Flash live on Venice.
- Together AI shipped V4 Pro with its own pricing and a 99.9% SLA pitch.
- opencode added both models in Go, with a note that capacity and usage limits were still being worked out.
- ollama's repost said
deepseek-v4-flashwas available on Ollama Cloud. - TheZachMueller's repost showed Flash running on a 256GB Mac through an MLX-LM pull request, at about 4.2 tokens per second and roughly 183GB peak memory.
One more rollout signal came from lmsysorg's follow-up, which cited 250 tok/s decode in a single-user production setup on SGLang. That number arrived after the initial launch wave and underlined the same theme as the paper: DeepSeek is trying to move the bottleneck from memory to engineering.
Compatibility footnotes
One of the most practical bugs surfaced after the launch party.
In badlogicgames' post, a missing reasoning_content field on some assistant messages caused 400 errors when deepseek-v4-flash was routed through OpenRouter with thinking enabled. The attached issue report said the session became unrecoverable after the first failure.
That is a small protocol detail, but it matters because DeepSeek explicitly shipped V4 into a router-heavy ecosystem. The same day-one record that showed quick adoption also showed the tax that new reasoning formats impose on harnesses, proxies, and compatibility layers.
The other footnote is more strategic. teortaxesTex's quote from the small print highlighted a line saying V4 Pro throughput was currently very limited because of high-end compute constraints, with lower prices expected after Ascend 950 super nodes are mass released in H2 2026. For all the talk about attention design, that one sentence keeps the hardware story in the frame.