Skip to content
AI Primer
release

Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.

2 min read
Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode
Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

TL;DR

  • Together's launch thread introduced Mamba-3 as an inference-first state-space model family, with a new MIMO variant that "fixes" decode bottlenecks by replacing the recurrence's vector outer-product with matrix multiply.
  • At 1.5B parameters, Together's launch thread says Mamba-3 delivers the fastest prefill plus decode and beats Mamba-2, GDN, and Llama-3.2-1B on the reported comparisons.
  • The linked paper and repo post says the MIMO version improves quality without increasing decoding latency, and Together has open-sourced the kernels alongside the release.
  • A benchmark table shared in the results screenshot shows the biggest quality lift at 1.5B comes from Mamba-3-MIMO, which reaches 57.6 average accuracy versus 56.4 for Mamba-3-SISO.

What shipped

Mamba-3 is the next Mamba release, but this one is explicitly tuned for deployment-time inference rather than training speed. Together's launch thread frames the problem as decode becoming memory-bound in agentic workloads and inference-heavy RL rollouts, while the linked blog post says Mamba-2 had focused more on training efficiency.

The main architectural change is MIMO, short for multi-input, multi-output. According to Together's paper and repo post, the model swaps the recurrence from a vector outer-product to a matrix multiply, aiming for a "stronger model at the same decode speed." The same post says kernels are open-sourced, with implementations using Triton, TileLang, and CuTe DSL in the public Mamba repository.

What the benchmarks show

The release claims are strongest at the 1.5B scale. Together's launch thread says Mamba-3 has the fastest prefill plus decode there and outperforms Mamba-2, GDN, and Llama-3.2-1B. The linked paper summary adds a concrete delta: versus Gated DeltaNet at 1.5B, Mamba-3 gains 0.6 points in downstream accuracy, and the MIMO variant adds another 1.2 points.

The shared table in the results screenshot breaks that out: Mamba-3-SISO-1.5B posts 56.4 average accuracy, while Mamba-3-MIMO-1.5B reaches 57.6, alongside stronger scores on Lambada accuracy, HellaSwag, PIQA, and ARC-C. That supports the release's core pitch: not just a faster linear model, but a higher-quality one that keeps decode speed intact.

A practitioner reaction from Cedric Chee's post sums up the engineering angle: the story looks less like replacing Transformers everywhere and more like trying to "win the deployment bottleneck."

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
What shipped1 post
What the benchmarks show1 post
Share on X