releaseMarch 18, 2026

Mamba-3 updates its inference path with MIMO decode and new state updates

New write-ups on Mamba-3 add more detail on its MIMO decode path, discretization changes, and complex-valued state updates. That gives infra teams a clearer basis for testing state-space models as inference-efficient alternatives in long-sequence or agent-heavy systems.

2 min read

Mamba-3 updates its inference path with MIMO decode and new state updates

TL;DR

Cartesia says the launch post positions Mamba-3 as an "inference-first" state-space model, reflecting a shift from training-optimized linear models toward architectures tuned for decode-heavy production workloads.
The most concrete implementation change in the paper summary is MIMO decode: replacing the recurrence's vector outer-product with matrix multiplication to raise hardware utilization, with the summary claiming up to 4x more decode FLOPs "without increasing latency."
The same summary table highlights two other architectural changes: an exponential-trapezoidal discretization rule that replaces simpler updates from Mamba-2, and complex-valued state updates via data-dependent RoPE for stronger state tracking.
Together's thread context adds the deployment angle: agent workloads and inference-heavy RL rollouts are making decode speed more important, and it claims Mamba-3 is fastest on combined prefill+decode at 1.5B against Mamba-2, Gated DeltaNet, and Llama-3.2-1B.

What changed in Mamba-3?

Cartesia's launch post frames Mamba-3 as a redesign for the part of the stack that now dominates cost and latency: inference. The linked write-up says earlier SSM advances helped efficiency, but Mamba-3 changes the model around "a world where AI workloads are increasingly dominated by inference," not just training throughput.

The clearest architectural deltas come from the paper summary. It describes a new exponential-trapezoidal discretization with a three-term recurrence that is "more expressive" than Mamba-2's exponential-Euler update, plus complex-valued state updates through data-dependent RoPE. In the summary's wording, that enables "rotational state dynamics" and improves tasks that require persistent state tracking, including parity-style problems that weaker linear dynamics struggle with.

Why this matters for serving and evals

Together's thread context ties the research to a familiar infra problem: linear models can look efficient in FLOPs while still being memory-bound during decode. Its description of the MIMO path is practical: swapping the recurrence from vector outer-product to matrix multiply lets the model do more useful compute during decoding at the same speed, which is exactly the kind of trade that matters when GPU utilization is the bottleneck.

That same thread context claims Mamba-3 delivers the fastest prefill+decode at 1.5B and beats Mamba-2, Gated DeltaNet, and Llama-3.2-1B at that scale. The paper summary adds a smaller but concrete quality delta, saying the MIMO variant improved accuracy by 1.2 points over a comparable baseline. Together also says kernels are open-sourced in the thread, which makes this more testable than a pure benchmark claim.

TL;DR

What changed in Mamba-3?

Why this matters for serving and evals

Discussion across the web