releaseMarch 22, 2026

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

LLM Serving Inference Optimization Developer Experience

2 min read

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

TL;DR

Flash-MoE's project page claims a pure C/Metal engine can run Qwen3.5-397B-A17B on a 48GB M3 Max MacBook Pro at "4.4+ tokens/second" while streaming a 209GB MoE model from SSD.
The implementation matters less as a product launch than as an inference case study: the Hacker News summary centers on expert streaming, quantization quality, bandwidth limits, mmap behavior, and OS page caching on laptop-class hardware.
The project page also says 4-bit mode supports tool calling, which makes this more than a one-token demo even if the reported speed is still modest.
The discussion thread quickly split between follow-on experiments like hybrid disk+RAM streaming and skepticism that lower-bit quants and reduced expert counts are good enough for "real work."

What shipped

Hacker Newspage361 points116 comments

Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Pure C/Metal inference engine running Qwen3.5-397B-A17B (397B parameter MoE model) on MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second, streaming 209GB model from SSD. Supports tool calling in 4-bit mode. Includes paper with technical details, optimizations like SSD expert streaming and OS page caching, build instructions, project structure, and performance tables.

Open linked page Open HN thread

The project page describes Flash-MoE as a pure C and Metal inference engine for Qwen3.5-397B-A17B, with the headline result of "4.4+ tokens/second" on a MacBook Pro M3 Max with 48GB RAM while streaming 209GB of weights from SSD. The repo also claims tool-calling support in 4-bit mode and includes build instructions, a paper, and performance tables via the GitHub project.

The engineering interest is the memory strategy. As the Hacker News summary frames it, this is a concrete test of SSD-backed expert streaming, OS page-cache behavior, and how far MoE offload can be pushed before bandwidth becomes the bottleneck.

What the thread adds on viability

Hacker Newsdiscussion361 points116 comments

Discussion around Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Thread discussion highlights: - tarruda on alternative Qwen3.5-397B quants: excellent ~2.5 BPW quants available that make it viable for 128G devices... great success (~20 t/s) running it on a M1 Ultra... included lm-evaluation-harness results - mkw on follow-on implementation: I took a stab at leveraging Dan's work and making it more practical: https://github.com/matt-k-wong/mlx-flash ... supports 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility - daemonologist on offload controls in existing engines: llama.cpp ... vllm ... sglang ... have extensive support for doing this and controlling exactly which weights end up where ... Even with a MoE model ... you do end up quite bandwidth constrained

Discussed by

tarruda on alternative Qwen3.5-397B quants
mkw on follow-on implementation
daemonologist on offload controls in existing engines

Open HN thread Open HN thread

The discussion thread adds more useful signal than cheerleading. One commenter reports "excellent ~2.5 BPW quants" that make the model viable on 128GB machines and claims "~20 t/s" on an M1 Ultra with lm-eval results, while another follow-on implementation adds "4bit quantization," "hybrid streaming (Disk + ram)," and broader model compatibility.

The same thread also pushes back on the benchmark's limits. According to the Hacker News summary, one critic says the setup used "2-bit quantization" and reduced experts per token from 10 to 4, calling that "particularly misleading" and arguing that 5-6 tok/s is "very slow." Another commenter notes that llama.cpp, vLLM, and sglang already expose detailed offload controls, and that even with MoE routing you still become "quite bandwidth constrained." The result is a useful benchmark for local-serving experiments, but not evidence that consumer laptops have escaped the usual quality-throughput tradeoffs.

🧾 More sources

Hacker Newscore361 points116 comments

Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Relevant as a case study in pushing large MoE inference onto limited-memory hardware. The useful takeaways are around quantization quality, expert streaming, bandwidth constraints, mmap/page behavior, and how this compares with existing offload support in llama.cpp, vLLM, and sglang.

Discussed by

tarruda on alternative Qwen3.5-397B quants
mkw on follow-on implementation
daemonologist on offload controls in existing engines

Open HN thread Open HN thread

releaseMarch 22, 2026

Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

LLM Serving Inference Optimization Developer Experience

2 min read

TL;DR

Flash-MoE's project page claims a pure C/Metal engine can run Qwen3.5-397B-A17B on a 48GB M3 Max MacBook Pro at "4.4+ tokens/second" while streaming a 209GB MoE model from SSD.
The implementation matters less as a product launch than as an inference case study: the Hacker News summary centers on expert streaming, quantization quality, bandwidth limits, mmap behavior, and OS page caching on laptop-class hardware.
The project page also says 4-bit mode supports tool calling, which makes this more than a one-token demo even if the reported speed is still modest.
The discussion thread quickly split between follow-on experiments like hybrid disk+RAM streaming and skepticism that lower-bit quants and reduced expert counts are good enough for "real work."

What shipped

Hacker Newspage361 points116 comments

Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Open linked page Open HN thread

What the thread adds on viability

Hacker Newsdiscussion361 points116 comments

Discussion around Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Discussed by

tarruda on alternative Qwen3.5-397B quants
mkw on follow-on implementation
daemonologist on offload controls in existing engines

Open HN thread Open HN thread

🧾 More sources

Hacker Newscore361 points116 comments

Flash-MoE: Running a 397B Parameter Model on a Laptop

Posted by mft_

Discussed by

tarruda on alternative Qwen3.5-397B quants
mkw on follow-on implementation
daemonologist on offload controls in existing engines

Open HN thread Open HN thread