releaseMarch 18, 2026

Mamba-3 updates its inference path with MIMO decode and new state updates

New write-ups on Mamba-3 add more detail on its MIMO decode path, discretization changes, and complex-valued state updates. That gives infra teams a clearer basis for testing state-space models as inference-efficient alternatives in long-sequence or agent-heavy systems.

LLM Serving Inference Optimization

2 min read

Mamba-3 updates its inference path with MIMO decode and new state updates

TL;DR

Cartesia says the launch post positions Mamba-3 as an "inference-first" state-space model, reflecting a shift from training-optimized linear models toward architectures tuned for decode-heavy production workloads.
The most concrete implementation change in the paper summary is MIMO decode: replacing the recurrence's vector outer-product with matrix multiplication to raise hardware utilization, with the summary claiming up to 4x more decode FLOPs "without increasing latency."
The same summary table highlights two other architectural changes: an exponential-trapezoidal discretization rule that replaces simpler updates from Mamba-2, and complex-valued state updates via data-dependent RoPE for stronger state tracking.
Together's thread context adds the deployment angle: agent workloads and inference-heavy RL rollouts are making decode speed more important, and it claims Mamba-3 is fastest on combined prefill+decode at 1.5B against Mamba-2, Gated DeltaNet, and Llama-3.2-1B.

What changed in Mamba-3?

Cartesia

@cartesia

·Follow

Mamba-3 is out! 🐍 SSMs marked a major advance for the efficiency of modern LLMs. Mamba-3 takes the next step, shaping SSMs for a world where AI workloads are increasingly dominated by inference. Read about it on the Cartesia blog: blog.cartesia.ai/p/mamba-3

6:39 PM · Mar 18, 2026

171

Read 3 replies

Cartesia's launch post frames Mamba-3 as a redesign for the part of the stack that now dominates cost and latency: inference. The linked write-up says earlier SSM advances helped efficiency, but Mamba-3 changes the model around "a world where AI workloads are increasingly dominated by inference," not just training throughput.

The clearest architectural deltas come from the paper summary. It describes a new exponential-trapezoidal discretization with a three-term recurrence that is "more expressive" than Mamba-2's exponential-Euler update, plus complex-valued state updates through data-dependent RoPE. In the summary's wording, that enables "rotational state dynamics" and improves tasks that require persistent state tracking, including parity-style problems that weaker linear dynamics struggle with.

Why this matters for serving and evals

Together AI

@togethercompute

·Follow

Day 2 of #NVIDIAGTC brought the heat — literally 📷 Hot wings, a lightning talk from 5C, Tokens After Hours with @Metronome Webhook, and our team met Jensen. Not a bad Tuesday. Day 3 kicks off soon — Together Trivia, cool prizes, Booth #1213. Come ready to booth #1213. 📷📷