releaseMarch 22, 2026

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

2 min read

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

TL;DR

Flash-MoE's project page claims a pure C/Metal engine can run Qwen3.5-397B-A17B on a 48GB M3 Max MacBook Pro at "4.4+ tokens/second" while streaming a 209GB MoE model from SSD.
The implementation matters less as a product launch than as an inference case study: the Hacker News summary centers on expert streaming, quantization quality, bandwidth limits, mmap behavior, and OS page caching on laptop-class hardware.
The project page also says 4-bit mode supports tool calling, which makes this more than a one-token demo even if the reported speed is still modest.
The discussion thread quickly split between follow-on experiments like hybrid disk+RAM streaming and skepticism that lower-bit quants and reduced expert counts are good enough for "real work."

What shipped

Hacker News

Flash-MoE: Running a 397B Parameter Model on a Laptop

361 upvotes · 116 comments

The project page describes Flash-MoE as a pure C and Metal inference engine for Qwen3.5-397B-A17B, with the headline result of "4.4+ tokens/second" on a MacBook Pro M3 Max with 48GB RAM while streaming 209GB of weights from SSD. The repo also claims tool-calling support in 4-bit mode and includes build instructions, a paper, and performance tables via the GitHub project.

The engineering interest is the memory strategy. As the Hacker News summary frames it, this is a concrete test of SSD-backed expert streaming, OS page-cache behavior, and how far MoE offload can be pushed before bandwidth becomes the bottleneck.

What the thread adds on viability

Hacker News

Discussion around Flash-MoE: Running a 397B Parameter Model on a Laptop

361 upvotes · 116 comments

The discussion thread adds more useful signal than cheerleading. One commenter reports "excellent ~2.5 BPW quants" that make the model viable on 128GB machines and claims "~20 t/s" on an M1 Ultra with lm-eval results, while another follow-on implementation adds "4bit quantization," "hybrid streaming (Disk + ram)," and broader model compatibility.

The same thread also pushes back on the benchmark's limits. According to the Hacker News summary, one critic says the setup used "2-bit quantization" and reduced experts per token from 10 to 4, calling that "particularly misleading" and arguing that 5-6 tok/s is "very slow." Another commenter notes that llama.cpp, vLLM, and sglang already expose detailed offload controls, and that even with MoE routing you still become "quite bandwidth constrained." The result is a useful benchmark for local-serving experiments, but not evidence that consumer laptops have escaped the usual quality-throughput tradeoffs.

TL;DR

What shipped

What the thread adds on viability

Discussion across the web