Skip to content
AI Primer
breaking

MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x

MiniMax started winding down its M2 series while previewing M3 and a new sparse-attention design with large long-context speedup claims. The teaser points to a fresh open-model race around block selection, GQA, and million-token serving efficiency.

3 min read
MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x
MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x

TL;DR

MiniMax opened with a tiny teaser, but kimmonismus pulled out the benchmark claims fast. eliebakouch then mapped the design against DeepSeek's recent sparse-attention variants, and MiniMax_AI's repost of that comparison effectively boosted the same framing from its official account. The odd part is the timing: andrew_n_carr's repost points back to a fresh M2 paper just as MiniMax_AI declares the M2 series over.

The 1M-token speed claims

The headline number is about serving at extreme context length, not a general performance bump. According to kimmonismus' summary of the teaser, the chart claims 9.7x faster prefilling and 15.6x faster decoding at 1M tokens versus M2.

That is a specific bet on long-context inference economics. testingcatalog's recap kept the same framing, pointing to sparse attention as the defining change behind M3 rather than a vague model-family refresh.

Block selection over real KV blocks

The clearest technical read came from eliebakouch's comparison thread, which described two concrete differences from DeepSeek-style sparse designs:

  • based on GQA, not MLA
  • block-level selection, similar to CSA
  • attention computed on the real KV cache, not only in a compressed dimension

That makes M3 look like a block-sparse retrieval scheme layered onto a more standard attention stack. teortaxesTex called it a streamlined version of the broader sparse-attention design space, while cedric_chee reduced the same point to a simpler label: block-based sparse attention.

M2's full-attention detour makes the teaser more notable

MiniMax did not arrive here by steadily pushing sparse attention from one release to the next. kimmonismus noted that MiniMax had deliberately gone back to full attention for M2 because earlier efficient-attention approaches were not production-ready.

That history changes the read on the teaser. M3 is being presented as a return to efficiency work after M2's retreat, with a two-stage path that kimmonismus' summary describes as a lightweight indexing branch for block selection followed by sparse attention on the chosen KV blocks.

The M2 handoff

The rollout sequence is unusually compressed. andrew_n_carr's repost says MiniMax had just consolidated the work behind M2 into a new arXiv paper, and hours later MiniMax_AI posted, "This marks the end of the M2 series, and MiniMax-M3 is coming."

That leaves one last concrete signal from the community side. teortaxesTex argued M2 was originally meant to be "Mini" but became strong enough to anchor a product generation, which is why the jump to a full-size M3 is already being read as more than a routine version bump.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR1 post
The 1M-token speed claims1 post
Block selection over real KV blocks2 posts
The M2 handoff1 post
Share on X