Moonshot claims 1.54x throughput and 64% lower P90 TTFT with cross-datacenter prefill
Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.

TL;DR
- Moonshot says its new Prefill-as-a-Service design pushes prefill/decode disaggregation across datacenters and mixed hardware, with Kimi_Moonshot's launch post tying that jump to Kimi Linear's smaller KV cache.
- The linked paper that crystalsssup surfaced reports 1.54× throughput and a 64% drop in P90 time to first token on a 20× scaled-up Kimi Linear case study, according to the arXiv HTML paper.
- The system is not just "send KV cache over Ethernet": the paper link in crystalsssup's post describes selective offloading for long prompts, bandwidth-aware scheduling, and cache-aware placement as the mechanisms that keep cross-datacenter traffic manageable.
- teortaxesTex's commentary frames the paper as part of a broader China-side push to treat infrastructure and cluster design, not just chip upgrades, as the main lever for inference efficiency.
You can jump straight to the paper, and teortaxesTex's screenshot is useful because it shows the actual deployment split: long requests route to a dedicated prefill cluster, short ones stay local, and a global KV cache manager sits between the two. The other useful detail is buried in the linked summary: Moonshot is claiming gains from scheduling and placement logic, not from KV compression alone.
Cross-datacenter prefill
Moonshot's core claim is simple: prefill and decode no longer need to stay inside one tightly coupled cluster if the model shrinks KV cache enough to make transfer practical. In Kimi_Moonshot's post, the company says Kimi Linear is the key enabler because it cuts KV cache size, which had been the main blocker for cross-datacenter PD disaggregation.
The quantitative headline is also narrow and concrete. According to that same launch post, a 20× scaled-up Kimi Linear deployment hit 1.54× throughput and cut P90 TTFT by 64%, while the arXiv HTML paper describes the setup as a cross-datacenter serving architecture rather than a single-cluster optimization.
Scheduling and placement
The paper summary linked in crystalsssup's post makes clear that the system has three moving parts:
- Selective offloading of long-context prefill to a separate cluster.
- Bandwidth-aware scheduling for cross-datacenter transfers.
- Cache-aware request placement to improve locality and utilization.
That matters because the reported win is not "heterogeneous hardware" in the abstract. The arXiv HTML paper, as summarized in the evidence link, says naive heterogeneous baselines were still worse, while the full PrfaaS design delivered 54% higher throughput than homogeneous PD and 32% higher throughput than naive heterogeneous setups.
Deployment topology
teortaxesTex's screenshot is the clearest visual for how Moonshot wants this deployed. Long requests above a threshold go to a compute-dense PrfaaS cluster for prefill, short requests stay in a local PD cluster, and the two sides are linked by cross-cluster Ethernet plus a global KV cache manager.
The same screenshot also shows both clusters keeping their own hybrid prefix cache pools and RDMA fabrics internally. That is a useful constraint: the paper is stretching PD disaggregation across datacenters, not replacing the fast local fabric inside each cluster.
Infrastructure-centric angle
The most interesting extra read is not in Moonshot's benchmark claim but in teortaxesTex's commentary, which connects the paper to a broader Chinese "infrastructure-empowered" computing strategy. The claim there is that if cutting-edge Nvidia refresh cycles are harder to rely on, more of the performance work shifts into making compute fungible across systems, fabrics, and datacenters.
That framing is commentary, not Moonshot's own wording. It does fit the paper's shape: a serving paper built around topology, schedulers, cache movement, and hardware mix, with the model architecture mainly serving as the thing that makes those systems tricks affordable enough to try.