Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

Posted by anemll
Anemll's Twitter post announces running a 400B parameter model on an iPhone at 0.6 tokens per second, crediting @danveloper, @alexintosh, @danpacary, and @anemll. Includes a demonstration video and links to a GitHub repo (github.com/Anemll/flash-moe) for the implementation, which uses techniques like giant KV cache and SSD streaming.
Anemll’s iPhone demo claims “Running 400B model on iPhone! 0.6 t/s” and ties it back to the Flash-MoE codebase. The post credits multiple collaborators and says the implementation uses “giant KV cache and SSD streaming,” extending the earlier laptop work rather than introducing a separate mobile stack.
The earlier Flash-MoE repo is more specific about the model family and engine: Qwen3.5-397B-A17B, a pure C/Metal inference path, and SSD-backed expert streaming on Apple hardware. Simon Willison’s thread summary captured the engineering leap cleanly: you can run “enormous Mixture-of-Experts” models without fitting the full model in RAM by streaming only the subset of experts needed for each generated token.
Posted by mft_
Flash-MoE is a pure C/Metal inference engine that runs the Qwen3.5-397B-A17B Mixture-of-Experts model (397B parameters, 209GB at 4-bit) on a MacBook Pro M3 Max with 48GB RAM at over 4.4 tokens/second, including tool calling. It streams expert weights from SSD on demand using parallel pread, with non-expert weights mmap'd (5.5GB). Built in 24 hours with hand-tuned Metal shaders, no Python or frameworks. Includes interactive chat, benchmarks, and optimizations inspired by Apple's LLM in a Flash.
The core trick is explicit storage-to-memory tiering. The project repo says the 397B Qwen3.5 MoE is 209GB at 4-bit, but Flash-MoE avoids keeping all of that resident: non-expert weights are mmap’d at 5.5GB, while expert weights are fetched from SSD with parallel pread as tokens are generated. The same writeup says the laptop version was built in “24 hours” with hand-tuned Metal shaders and no Python frameworks.
That framing also explains why direct throughput comparisons are messy. The HN discussion notes this is “not an ordinary LLM benchmark” because the system is streaming weights from storage, so comparing it to fully resident local models can be misleading. Simon Willison’s earlier Mac hardware thread adds a useful MoE lens here: even trillion-parameter models can become plausible on Apple hardware when the active parameter count is much smaller than the total model size.
Posted by mft_
Today’s new discussion mostly zooms in on performance interpretation rather than introducing new project facts. One commenter highlights a 38% speedup from removing a 9.8GB Metal LRU cache and asks whether the remaining gap to SSD-bandwidth limits is compute-bound or still I/O-bound; others debate whether the model is actually usable at 4–6 tok/s versus the much higher thresholds they consider practical for real workflows. There’s also a small but important framing correction: one commenter notes this is not a normal local-model benchmark, because the engine is streaming expert weights from storage, so comparisons against fully resident local LLMs are not apples-to-apples.
The most concrete new performance datapoint is in the fresh HN delta: one commenter highlighted “removing the 9.8 GB Metal LRU cache” for a 38% speedup, then asked why the system still lands around 5.7 tok/s versus an 18.6 tok/s theoretical ceiling. That shifts the conversation from novelty to bottleneck analysis: cache overhead, compute saturation, and I/O scheduling still appear unresolved.
The usability debate is even harsher on phone. The HN discussion quotes one commenter arguing that “under 20t/s” is “unusable in any real workflow,” and another saying that at 6 tok/s a mistake can cost “20-30 minutes.” On the iPhone side, the iPhone discussion keeps returning to “the heat problem,” limited RAM for “any reasonable amount of context,” and the fact that Apple’s unified memory helps but does not remove those constraints. The engineering result is real; the deployment envelope still looks narrow.
Posted by anemll
The useful angle is that this is a proof-of-concept for extreme on-device inference: a 400B model, SSD streaming, and large KV-cache tricks on a phone. The thread’s main engineering takeaways are the hard limits—RAM bandwidth, context capacity, heat, and battery—plus the reminder that demos can be technically valid without being deployable.
Posted by anemll
Thread discussion highlights: - firstbabylonian on Apple LLM-in-a-flash precedent: Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash'?" - Aurornis on unified memory and model size tradeoff: Apple’s unified memory architecture plays a huge part in this... There is already a smaller model in this series that fits nicely into the iPhone... The smaller the model, the less accurate and capable it is. - johnwhitman on thermals and internal tooling: The heat problem is going to be the real constraint here... even those make my MacBook sound like a jet engine after twenty minutes.
Posted by mft_
The key takeaway is a memory-tiering and inference-engine experiment: Flash-MoE streams experts from SSD, and today’s discussion focuses on whether the bottleneck is cache overhead, GPU compute, or I/O scheduling. The thread also surfaces a practical engineering question: how meaningful is a benchmark when it relies on nonstandard streaming and lands at ~4–6 tok/s?
A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.