Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.

Mamba-3 is the next Mamba release, but this one is explicitly tuned for deployment-time inference rather than training speed. Together's launch thread frames the problem as decode becoming memory-bound in agentic workloads and inference-heavy RL rollouts, while the linked blog post says Mamba-2 had focused more on training efficiency.
The main architectural change is MIMO, short for multi-input, multi-output. According to Together's paper and repo post, the model swaps the recurrence from a vector outer-product to a matrix multiply, aiming for a "stronger model at the same decode speed." The same post says kernels are open-sourced, with implementations using Triton, TileLang, and CuTe DSL in the public Mamba repository.
The release claims are strongest at the 1.5B scale. Together's launch thread says Mamba-3 has the fastest prefill plus decode there and outperforms Mamba-2, GDN, and Llama-3.2-1B. The linked paper summary adds a concrete delta: versus Gated DeltaNet at 1.5B, Mamba-3 gains 0.6 points in downstream accuracy, and the MIMO variant adds another 1.2 points.
The shared table in the results screenshot breaks that out: Mamba-3-SISO-1.5B posts 56.4 average accuracy, while Mamba-3-MIMO-1.5B reaches 57.6, alongside stronger scores on Lambada accuracy, HellaSwag, PIQA, and ARC-C. That supports the release's core pitch: not just a faster linear model, but a higher-quality one that keeps decode speed intact.
A practitioner reaction from Cedric Chee's post sums up the engineering angle: the story looks less like replacing Transformers everywhere and more like trying to "win the deployment bottleneck."
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Introducing Mamba-3 🐍 Inference speeds are more important than ever, driven by the rise in agents and inference-heavy RL Show more
Mamba-3 just got released

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities