breakingMay 26, 2026

MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x

MiniMax started winding down its M2 series while previewing M3 and a new sparse-attention design with large long-context speedup claims. The teaser points to a fresh open-model race around block selection, GQA, and million-token serving efficiency.

3 min read

MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x

TL;DR

MiniMax started winding down the M2 line while MiniMax_AI's follow-up post said MiniMax-M3 is next.
According to kimmonismus' summary of the teaser, MiniMax is claiming 1M-token serving gains of 9.7x on prefill and 15.6x on decoding versus M2.
eliebakouch's architecture comparison says the new design is GQA-based, uses block-level selection, and runs attention on real KV blocks rather than a compressed dimension.
MiniMax_AI's original teaser and testingcatalog's recap both frame M3 as an open-source model, which puts the speedup claims directly into the open-model long-context race.

MiniMax opened with a tiny teaser, but kimmonismus pulled out the benchmark claims fast. eliebakouch then mapped the design against DeepSeek's recent sparse-attention variants, and MiniMax_AI's repost of that comparison effectively boosted the same framing from its official account. The odd part is the timing: andrew_n_carr's repost points back to a fresh M2 paper just as MiniMax_AI declares the M2 series over.

The 1M-token speed claims

The headline number is about serving at extreme context length, not a general performance bump. According to kimmonismus' summary of the teaser, the chart claims 9.7x faster prefilling and 15.6x faster decoding at 1M tokens versus M2.

That is a specific bet on long-context inference economics. testingcatalog's recap kept the same framing, pointing to sparse attention as the defining change behind M3 rather than a vague model-family refresh.

Block selection over real KV blocks

The clearest technical read came from eliebakouch's comparison thread, which described two concrete differences from DeepSeek-style sparse designs:

based on GQA, not MLA
block-level selection, similar to CSA
attention computed on the real KV cache, not only in a compressed dimension

That makes M3 look like a block-sparse retrieval scheme layered onto a more standard attention stack. teortaxesTex called it a streamlined version of the broader sparse-attention design space, while cedric_chee reduced the same point to a simpler label: block-based sparse attention.

M2's full-attention detour makes the teaser more notable

MiniMax did not arrive here by steadily pushing sparse attention from one release to the next. kimmonismus noted that MiniMax had deliberately gone back to full attention for M2 because earlier efficient-attention approaches were not production-ready.

That history changes the read on the teaser. M3 is being presented as a return to efficiency work after M2's retreat, with a two-stage path that kimmonismus' summary describes as a lightweight indexing branch for block selection followed by sparse attention on the chosen KV blocks.

The M2 handoff

The rollout sequence is unusually compressed. andrew_n_carr's repost says MiniMax had just consolidated the work behind M2 into a new arXiv paper, and hours later MiniMax_AI posted, "This marks the end of the M2 series, and MiniMax-M3 is coming."

That leaves one last concrete signal from the community side. teortaxesTex argued M2 was originally meant to be "Mini" but became strong enough to anchor a product generation, which is why the jump to a full-size M3 is already being read as more than a routine version bump.

TL;DR

The 1M-token speed claims

Block selection over real KV blocks

M2's full-attention detour makes the teaser more notable

The M2 handoff

Discussion across the web