releaseMarch 22, 2026

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

LLM Serving Inference Optimization Developer Experience

2 min read

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

TL;DR

Flash-MoE's project page claims a pure C/Metal engine can run Qwen3.5-397B-A17B on a 48GB M3 Max MacBook Pro at "4.4+ tokens/second" while streaming a 209GB MoE model from SSD.
The implementation matters less as a product launch than as an inference case study: the Hacker News summary centers on expert streaming, quantization quality, bandwidth limits, mmap behavior, and OS page caching on laptop-class hardware.
The project page also says 4-bit mode supports tool calling, which makes this more than a one-token demo even if the reported speed is still modest.
The discussion thread quickly split between follow-on experiments like hybrid disk+RAM streaming and skepticism that lower-bit quants and reduced expert counts are good enough for "real work."