Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

TL;DR
- Flash-MoE’s latest demo pushes the same SSD-streaming idea from laptop to phone: Anemll’s iPhone demo says a roughly 400B model runs on an iPhone at 0.6 tokens/sec, while the earlier Flash-MoE repo documented Qwen3.5-397B-A17B running on an M3 Max MacBook Pro.
- The implementation detail that matters is memory tiering, not mobile UX: the project repo says expert weights are streamed from SSD on demand, with only non-expert weights resident, and the HN summary frames the phone run as an “extreme on-device inference” proof of concept.
- Fresh discussion added one concrete optimization result: according to new performance discussion, removing a 9.8GB Metal LRU cache produced a 38% speedup, but commenters still see a large gap between observed throughput and theoretical limits.
- The practical limits stayed the same as the demo got smaller: the iPhone discussion and HN summary both center RAM bandwidth, context headroom, thermals, and battery, which is why the current result reads more like a systems prototype than a deployable mobile serving target.
What actually ran on the iPhone?
Running 400B model on iPhone! 0.6 t/s
546 upvotes · 253 comments
Anemll’s iPhone demo claims “Running 400B model on iPhone! 0.6 t/s” and ties it back to the Flash-MoE codebase. The post credits multiple collaborators and says the implementation uses “giant KV cache and SSD streaming,” extending the earlier laptop work rather than introducing a separate mobile stack.
The earlier Flash-MoE repo is more specific about the model family and engine: Qwen3.5-397B-A17B, a pure C/Metal inference path, and SSD-backed expert streaming on Apple hardware. Simon Willison’s thread summary captured the engineering leap cleanly: you can run “enormous Mixture-of-Experts” models without fitting the full model in RAM by streaming only the subset of experts needed for each generated token.
How does Flash-MoE fit a 397B-class MoE on Apple hardware?
Flash-MoE: Running a 397B Parameter Model on a Laptop
390 upvotes · 121 comments
The core trick is explicit storage-to-memory tiering. The project repo says the 397B Qwen3.5 MoE is 209GB at 4-bit, but Flash-MoE avoids keeping all of that resident: non-expert weights are mmap’d at 5.5GB, while expert weights are fetched from SSD with parallel pread as tokens are generated. The same writeup says the laptop version was built in “24 hours” with hand-tuned Metal shaders and no Python frameworks.
That framing also explains why direct throughput comparisons are messy. The HN discussion notes this is “not an ordinary LLM benchmark” because the system is streaming weights from storage, so comparing it to fully resident local models can be misleading. Simon Willison’s earlier Mac hardware thread adds a useful MoE lens here: even trillion-parameter models can become plausible on Apple hardware when the active parameter count is much smaller than the total model size.
Why engineers should read this as a prototype, not a product benchmark
Fresh discussion on Flash-MoE: Running a 397B Parameter Model on a Laptop
390 upvotes · 121 comments
The most concrete new performance datapoint is in the fresh HN delta: one commenter highlighted “removing the 9.8 GB Metal LRU cache” for a 38% speedup, then asked why the system still lands around 5.7 tok/s versus an 18.6 tok/s theoretical ceiling. That shifts the conversation from novelty to bottleneck analysis: cache overhead, compute saturation, and I/O scheduling still appear unresolved.
The usability debate is even harsher on phone. The HN discussion quotes one commenter arguing that “under 20t/s” is “unusable in any real workflow,” and another saying that at 6 tok/s a mistake can cost “20-30 minutes.” On the iPhone side, the iPhone discussion keeps returning to “the heat problem,” limited RAM for “any reasonable amount of context,” and the fact that Apple’s unified memory helps but does not remove those constraints. The engineering result is real; the deployment envelope still looks narrow.