releaseMarch 18, 2026

Mamba-3 updates its inference path with MIMO decode and new state updates

New write-ups on Mamba-3 add more detail on its MIMO decode path, discretization changes, and complex-valued state updates. That gives infra teams a clearer basis for testing state-space models as inference-efficient alternatives in long-sequence or agent-heavy systems.

LLM Serving Inference Optimization

2 min read

Mamba-3 updates its inference path with MIMO decode and new state updates

TL;DR

Cartesia says the launch post positions Mamba-3 as an "inference-first" state-space model, reflecting a shift from training-optimized linear models toward architectures tuned for decode-heavy production workloads.
The most concrete implementation change in the paper summary is MIMO decode: replacing the recurrence's vector outer-product with matrix multiplication to raise hardware utilization, with the summary claiming up to 4x more decode FLOPs "without increasing latency."
The same summary table highlights two other architectural changes: an exponential-trapezoidal discretization rule that replaces simpler updates from Mamba-2, and complex-valued state updates via data-dependent RoPE for stronger state tracking.
Together's thread context adds the deployment angle: agent workloads and inference-heavy RL rollouts are making decode speed more important, and it claims Mamba-3 is fastest on combined prefill+decode at 1.5B against Mamba-2, Gated DeltaNet, and Llama-3.2-1B.

What changed in Mamba-3?

Cartesia's launch post frames Mamba-3 as a redesign for the part of the stack that now dominates cost and latency: inference. The linked write-up says earlier SSM advances helped efficiency, but Mamba-3 changes the model around "a world where AI workloads are increasingly dominated by inference," not just training throughput.

The clearest architectural deltas come from the paper summary. It describes a new exponential-trapezoidal discretization with a three-term recurrence that is "more expressive" than Mamba-2's exponential-Euler update, plus complex-valued state updates through data-dependent RoPE. In the summary's wording, that enables "rotational state dynamics" and improves tasks that require persistent state tracking, including parity-style problems that weaker linear dynamics struggle with.

Why this matters for serving and evals

Together's thread context ties the research to a familiar infra problem: linear models can look efficient in FLOPs while still being memory-bound during decode. Its description of the MIMO path is practical: swapping the recurrence from vector outer-product to matrix multiply lets the model do more useful compute during decoding at the same speed, which is exactly the kind of trade that matters when GPU utilization is the bottleneck.

That same thread context claims Mamba-3 delivers the fastest prefill+decode at 1.5B and beats Mamba-2, Gated DeltaNet, and Llama-3.2-1B at that scale. The paper summary adds a smaller but concrete quality delta, saying the MIMO variant improved accuracy by 1.2 points over a comparable baseline. Together also says kernels are open-sourced in the thread, which makes this more testable than a pure benchmark claim.

🧾 More sources

TL;DR1 tweets

Top-level summary of the launch and the specific model changes with the strongest technical claims.

What changed in Mamba-3?1 tweets

Core model-side changes: inference-first positioning, new discretization, and complex-valued state updates.

Why this matters for serving and evals1 tweets

Serving implications of the MIMO decode path and the benchmark claims most relevant to infra teams.

releaseMarch 18, 2026

Mamba-3 updates its inference path with MIMO decode and new state updates

LLM Serving Inference Optimization

2 min read

TL;DR

Cartesia says the launch post positions Mamba-3 as an "inference-first" state-space model, reflecting a shift from training-optimized linear models toward architectures tuned for decode-heavy production workloads.
The most concrete implementation change in the paper summary is MIMO decode: replacing the recurrence's vector outer-product with matrix multiplication to raise hardware utilization, with the summary claiming up to 4x more decode FLOPs "without increasing latency."
The same summary table highlights two other architectural changes: an exponential-trapezoidal discretization rule that replaces simpler updates from Mamba-2, and complex-valued state updates via data-dependent RoPE for stronger state tracking.
Together's thread context adds the deployment angle: agent workloads and inference-heavy RL rollouts are making decode speed more important, and it claims Mamba-3 is fastest on combined prefill+decode at 1.5B against Mamba-2, Gated DeltaNet, and Llama-3.2-1B.

What changed in Mamba-3?

Cartesia

@cartesia

·Follow

Mamba-3 is out! 🐍 SSMs marked a major advance for the efficiency of modern LLMs. Mamba-3 takes the next step, shaping SSMs for a world where AI workloads are increasingly dominated by inference. Read about it on the Cartesia blog: blog.cartesia.ai/p/mamba-3

6:39 PM · Mar 18, 2026

172

Read 3 replies

Why this matters for serving and evals

Together AI

@togethercompute