Skip to content
AI Primer
release

DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

8 min read
DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context
DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

TL;DR

You can read the official paper PDF, browse the model collection on Hugging Face, check DeepSeek's launch announcement and API notes, skim Simon Willison's early hands-on writeup, and compare the day-one serving work in the vLLM thread and SGLang cookbook. The weird bit is that launch-day discourse split fast: Artificial Analysis liked Pro's open-weight standing, bridgemindai did not, and Hacker News commenters kept steering people back to Flash.

What shipped

DeepSeek's preview release is a two-model lineup, not a single flagship. iScienceLuvr's release post

The official package, visible in the Hugging Face collection and echoed by thursdai_pod's launch thread, breaks down like this:

  • V4 Pro: 1.6T total params, 49B active, 1M context
  • V4 Flash: 284B total params, 13B active, 1M context
  • License: MIT
  • Modes: thinking and non-thinking variants
  • Pricing: Pro at $1.74 input and $3.48 output per 1M tokens, Flash at $0.14 input and $0.28 output per 1M tokens, per thursdai_pod's pricing screenshot

A smaller but important product change sits in thursdai_pod's follow-up: deepseek-chat and deepseek-reasoner are being deprecated and mapped to V4 Flash non-thinking and thinking for compatibility. That makes Flash look like the default traffic path, not a sidecar model.

CSA and HCA

The paper frames V4 as an answer to long-context cost, and the two new attention modes are the mechanism. iScienceLuvr's paper screenshot

Across ben_burtenshaw's summary, nrehiew_'s CSA walkthrough, and nrehiew_'s HCA note, the stack resolves into a few concrete pieces:

  1. CSA, Compressed Sparse Attention: compresses KV blocks at 4x, then runs top-k sparse selection over the compressed stream.
  2. HCA, Heavily Compressed Attention: compresses much more aggressively, at 128x, and skips the selector.
  3. Interleaving across layers: V4 alternates less sparse and more sparse attention instead of pretending every layer wants the same memory pattern.
  4. Sliding-window locality: recent tokens still get direct local treatment so the compressor does not erase nearby detail.
  5. MQA over compressed KV: several readers, including stochasticchasm's MQA note and eliebakouch's MQA note, flagged this as one of the stranger but more elegant parts of the design.

The payoff is the chart DeepSeek wanted everyone to see. At 1M context, ben_burtenshaw's efficiency post says V4 Pro falls to 27% of V3.2's single-token FLOPs and 10% of its KV cache, while V4 Flash drops to 10% of FLOPs and 7% of KV cache.

Serving the cache

The architecture change only matters if somebody can actually serve it. That is why the day-zero infra posts were almost as interesting as the launch itself.

According to vLLM's implementation notes, supporting V4 in vLLM required:

  • a unified hybrid KV cache across multiple compression rates
  • page-size bucketing for a five-way cache stack
  • fused kernels for compression, RoPE, quantization, and cache insert
  • multistream overlap between indexer work, main-KV compression, and sliding-window insertion

SGLang took a similar angle but emphasized serving features around the architecture: its launch post called out ShadowRadix prefix caching for compressed KV pools, a Flash Compressor kernel, a Lightning TopK indexer, HiSparse for CPU-backed sparse-attention cache extension, and a verified Miles RL pipeline.

That breadth helps explain why some launch-day reactions treated V4 as an infra paper wearing a model release badge. ben_burtenshaw's infra note called the deterministic kernel work alone paper-worthy, and stochasticchasm's final impressions said the standout was how much engineering got compressed into a few report pages.

External scorecards

Vendor charts were flattering, but the outside picture was more useful.

The cleanest third-party snapshot came from Artificial Analysis, which reported that V4 Pro scored 52 on its Intelligence Index, up from 42 for V3.2, and landed second among open-weight reasoning models behind Kimi K2.6. The same post put V4 Pro at 1554 on GDPval-AA, ahead of Kimi K2.6, GLM-5.1, GLM-5, and MiniMax-M2.7.

Artificial Analysis also added two caveats that got less attention in the hype cycle:

Other evaluators leaned more positive on coding. Vals AI said V4 became its top open-weight model on Vibe Code Benchmark, and Vals AI's follow-up claimed #1 positions on SWE Bench, LiveCodeBench, IOI, and Vibe Code Bench inside its own stack.

Mixed reality on Pro versus Flash

The most consistent launch-day argument was not whether V4 was good. It was which V4 mattered.

Y
Hacker News

DeepSeek v4

1.9k upvotes · 1.5k comments

In the main HN thread, one of the most upvoted practitioner comments called Flash the model to watch because it was cheap, effective, and fast, while Pro was described as slow, unreliable, and too rate-limited to be very useful. Another HN comment in the same thread questioned whether the official benchmarks matched what third-party testers were seeing.

That split showed up elsewhere. bridgemindai's Code Arena screenshot placed DeepSeek V4 Pro at #14 on LMArena Code, and bridgemindai's BridgeBench repost said V4 Pro ranked last on BridgeBench. On the other side, arena's Pareto post argued that V4 Flash Thinking shifted the text price-performance frontier, while arena's Pro versus Flash comparison said Pro ranked about 30 places higher than Flash variants at roughly 12x the launch price.

The strongest day-one reading is not that the benchmarks were wrong. It is that Flash had a clearer product story. Pro arrived as a showcase model with obvious throughput constraints, and Flash arrived as a budget agent model people could actually picture using.

Ecosystem rollout

DeepSeek did not wait for the ecosystem to catch up. The ecosystem was part of the launch.

By the end of day one, the evidence pool showed support across hosted inference, local runtimes, and developer tooling:

  • AskVenice put Pro and Flash live on Venice.
  • Together AI shipped V4 Pro with its own pricing and a 99.9% SLA pitch.
  • opencode added both models in Go, with a note that capacity and usage limits were still being worked out.
  • ollama's repost said deepseek-v4-flash was available on Ollama Cloud.
  • TheZachMueller's repost showed Flash running on a 256GB Mac through an MLX-LM pull request, at about 4.2 tokens per second and roughly 183GB peak memory.

One more rollout signal came from lmsysorg's follow-up, which cited 250 tok/s decode in a single-user production setup on SGLang. That number arrived after the initial launch wave and underlined the same theme as the paper: DeepSeek is trying to move the bottleneck from memory to engineering.

Compatibility footnotes

One of the most practical bugs surfaced after the launch party.

In badlogicgames' post, a missing reasoning_content field on some assistant messages caused 400 errors when deepseek-v4-flash was routed through OpenRouter with thinking enabled. The attached issue report said the session became unrecoverable after the first failure.

That is a small protocol detail, but it matters because DeepSeek explicitly shipped V4 into a router-heavy ecosystem. The same day-one record that showed quick adoption also showed the tax that new reasoning formats impose on harnesses, proxies, and compatibility layers.

The other footnote is more strategic. teortaxesTex's quote from the small print highlighted a line saying V4 Pro throughput was currently very limited because of high-end compute constraints, with lower prices expected after Ascend 950 super nodes are mass released in H2 2026. For all the talk about attention design, that one sentence keeps the hardware story in the frame.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 8 threads
TL;DR6 posts
What shipped2 posts
CSA and HCA6 posts
Serving the cache2 posts
External scorecards2 posts
Mixed reality on Pro versus Flash4 posts
Ecosystem rollout5 posts
Compatibility footnotes1 post