releaseMarch 17, 2026

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.

LLM Serving Inference Optimization Benchmarks

2 min read

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

TL;DR

Together's launch thread introduced Mamba-3 as an inference-first state-space model family, with a new MIMO variant that "fixes" decode bottlenecks by replacing the recurrence's vector outer-product with matrix multiply.
At 1.5B parameters, Together's launch thread says Mamba-3 delivers the fastest prefill plus decode and beats Mamba-2, GDN, and Llama-3.2-1B on the reported comparisons.
The linked paper and repo post says the MIMO version improves quality without increasing decoding latency, and Together has open-sourced the kernels alongside the release.
A benchmark table shared in the results screenshot shows the biggest quality lift at 1.5B comes from Mamba-3-MIMO, which reaches 57.6 average accuracy versus 56.4 for Mamba-3-SISO.

What shipped

Together AI

@togethercompute

·Follow

Introducing Mamba-3 🐍 Inference speeds are more important than ever, driven by the rise in agents and inference-heavy RL Show more

Albert Gu

@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities

5:19 PM · Mar 17, 2026

294

Read 8 replies

Mamba-3 is the next Mamba release, but this one is explicitly tuned for deployment-time inference rather than training speed. Together's launch thread frames the problem as decode becoming memory-bound in agentic workloads and inference-heavy RL rollouts, while the linked blog post says Mamba-2 had focused more on training efficiency.

The main architectural change is MIMO, short for multi-input, multi-output. According to Together's paper and repo post, the model swaps the recurrence from a vector outer-product to a matrix multiply, aiming for a "stronger model at the same decode speed." The same post says kernels are open-sourced, with implementations using Triton, TileLang, and CuTe DSL in the public Mamba repository.

What the benchmarks show

Lisan al Gaib

@scaling01

·Follow

Mamba-3 just got released

Albert Gu

@_albertgu

4:49 PM · Mar 17, 2026

158

Read 3 replies

The release claims are strongest at the 1.5B scale. Together's launch thread says Mamba-3 has the fastest prefill plus decode there and outperforms Mamba-2, GDN, and Llama-3.2-1B. The linked paper summary adds a concrete delta: versus Gated DeltaNet at 1.5B, Mamba-3 gains 0.6 points in downstream accuracy, and the MIMO variant adds another 1.2 points.

The shared table in the results screenshot breaks that out: Mamba-3-SISO-1.5B posts 56.4 average accuracy, while Mamba-3-MIMO-1.5B reaches 57.6, alongside stronger scores on Lambada accuracy, HellaSwag, PIQA, and ARC-C. That supports the release's core pitch: not just a faster linear model, but a higher-quality one that keeps decode speed intact.

A practitioner reaction from Cedric Chee's post sums up the engineering angle: the story looks less like replacing Transformers everywhere and more like trying to "win the deployment bottleneck."

🧾 More sources

TL;DR1 tweets

Top-line release facts: what Mamba-3 is, what changed in MIMO, and the headline performance claims.

What shipped1 tweets

Covers the release mechanics, architecture change, and availability of code and kernels.

What the benchmarks show1 tweets

Groups the reported benchmark outcomes and the practical deployment interpretation.