A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

Posted by mft_
Pure C/Metal inference engine running Qwen3.5-397B-A17B (397B parameter MoE model) on MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second, streaming 209GB model from SSD. Supports tool calling in 4-bit mode. Includes paper with technical details, optimizations like SSD expert streaming and OS page caching, build instructions, project structure, and performance tables.
The project page describes Flash-MoE as a pure C and Metal inference engine for Qwen3.5-397B-A17B, with the headline result of "4.4+ tokens/second" on a MacBook Pro M3 Max with 48GB RAM while streaming 209GB of weights from SSD. The repo also claims tool-calling support in 4-bit mode and includes build instructions, a paper, and performance tables via the GitHub project.
The engineering interest is the memory strategy. As the Hacker News summary frames it, this is a concrete test of SSD-backed expert streaming, OS page-cache behavior, and how far MoE offload can be pushed before bandwidth becomes the bottleneck.
Posted by mft_
Thread discussion highlights: - tarruda on alternative Qwen3.5-397B quants: excellent ~2.5 BPW quants available that make it viable for 128G devices... great success (~20 t/s) running it on a M1 Ultra... included lm-evaluation-harness results - mkw on follow-on implementation: I took a stab at leveraging Dan's work and making it more practical: https://github.com/matt-k-wong/mlx-flash ... supports 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility - daemonologist on offload controls in existing engines: llama.cpp ... vllm ... sglang ... have extensive support for doing this and controlling exactly which weights end up where ... Even with a MoE model ... you do end up quite bandwidth constrained
The discussion thread adds more useful signal than cheerleading. One commenter reports "excellent ~2.5 BPW quants" that make the model viable on 128GB machines and claims "~20 t/s" on an M1 Ultra with lm-eval results, while another follow-on implementation adds "4bit quantization," "hybrid streaming (Disk + ram)," and broader model compatibility.
The same thread also pushes back on the benchmark's limits. According to the Hacker News summary, one critic says the setup used "2-bit quantization" and reduced experts per token from 10 to 4, calling that "particularly misleading" and arguing that 5-6 tok/s is "very slow." Another commenter notes that llama.cpp, vLLM, and sglang already expose detailed offload controls, and that even with MoE routing you still become "quite bandwidth constrained." The result is a useful benchmark for local-serving experiments, but not evidence that consumer laptops have escaped the usual quality-throughput tradeoffs.
Posted by mft_
Relevant as a case study in pushing large MoE inference onto limited-memory hardware. The useful takeaways are around quantization quality, expert streaming, bandwidth constraints, mmap/page behavior, and how this compares with existing offload support in llama.cpp, vLLM, and sglang.