A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

Posted by mft_
Pure C/Metal inference engine running Qwen3.5-397B-A17B (397B parameter MoE model) on MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second, streaming 209GB model from SSD. Supports tool calling in 4-bit mode. Includes paper with technical details, optimizations like SSD expert streaming and OS page caching, build instructions, project structure, and performance tables.
The project page describes Flash-MoE as a pure C and Metal inference engine for Qwen3.5-397B-A17B, with the headline result of "4.4+ tokens/second" on a MacBook Pro M3 Max with 48GB RAM while streaming 209GB of weights from SSD. The repo also claims tool-calling support in 4-bit mode and includes build instructions, a paper, and performance tables via the GitHub project.
The engineering interest is the memory strategy. As the Hacker News summary frames it, this is a concrete test of SSD-backed expert streaming, OS page-cache behavior, and how far MoE offload can be pushed before bandwidth becomes the bottleneck.
Posted by mft_
Thread discussion highlights: - tarruda on alternative Qwen3.5-397B quants: excellent ~2.5 BPW quants available that make it viable for 128G devices... great success (~20 t/s) running it on a M1 Ultra... included lm-evaluation-harness results - mkw on follow-on implementation: I took a stab at leveraging Dan's work and making it more practical: https://github.com/matt-k-wong/mlx-flash ... supports 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility - daemonologist on offload controls in existing engines: llama.cpp ... vllm ... sglang ... have extensive support for doing this and controlling exactly which weights end up where ... Even with a MoE model ... you do end up quite bandwidth constrained
The discussion thread adds more useful signal than cheerleading. One commenter reports "excellent ~2.5 BPW quants" that make the model viable on 128GB machines and claims "~20 t/s" on an M1 Ultra with lm-eval results, while another follow-on implementation adds "4bit quantization," "hybrid streaming (Disk + ram)," and broader model compatibility.
The same thread also pushes back on the benchmark's limits. According to the Hacker News summary, one critic says the setup used "2-bit quantization" and reduced experts per token from 10 to 4, calling that "particularly misleading" and arguing that 5-6 tok/s is "very slow." Another commenter notes that llama.cpp, vLLM, and sglang already expose detailed offload controls, and that even with MoE routing you still become "quite bandwidth constrained." The result is a useful benchmark for local-serving experiments, but not evidence that consumer laptops have escaped the usual quality-throughput tradeoffs.
Posted by mft_
Relevant as a case study in pushing large MoE inference onto limited-memory hardware. The useful takeaways are around quantization quality, expert streaming, bandwidth constraints, mmap/page behavior, and how this compares with existing offload support in llama.cpp, vLLM, and sglang.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.