updateMarch 23, 2026

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

LLM Serving Inference Optimization KV Cache

4 min read

Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

TL;DR

Flash-MoE’s latest demo pushes the same SSD-streaming idea from laptop to phone: Anemll’s iPhone demo says a roughly 400B model runs on an iPhone at 0.6 tokens/sec, while the earlier Flash-MoE repo documented Qwen3.5-397B-A17B running on an M3 Max MacBook Pro.
The implementation detail that matters is memory tiering, not mobile UX: the project repo says expert weights are streamed from SSD on demand, with only non-expert weights resident, and the HN summary frames the phone run as an “extreme on-device inference” proof of concept.
Fresh discussion added one concrete optimization result: according to new performance discussion, removing a 9.8GB Metal LRU cache produced a 38% speedup, but commenters still see a large gap between observed throughput and theoretical limits.