breakingMarch 15, 2026

Moonshot introduces Attention Residuals with 1.25x compute gains on Kimi Linear

Moonshot introduced Attention Residuals, replacing fixed depth-wise residual accumulation with learned lookbacks over earlier layers, and reports a 1.25x compute advantage on Kimi Linear. Try it as a drop-in lever for deeper stacks, but verify memory tradeoffs and downstream gains on your own architecture.

Inference Optimization Benchmarks

3 min read

Moonshot introduces Attention Residuals with 1.25x compute gains on Kimi Linear

TL;DR

Moonshot says its new launch thread swaps fixed residual accumulation for learned, input-dependent lookbacks over earlier layers, turning residual paths into a selective retrieval mechanism instead of a uniform sum.
In Moonshot's scaling-law post, Attention Residuals show a consistent 1.25x compute advantage across model sizes, while the main announcement says inference overhead stays below 2%.
The company says the method was validated on Kimi Linear, a 48B-parameter sparse architecture with 3B activated parameters, where the announcement reports consistent downstream gains.
Moonshot's training-dynamics post claims the approach also curbs hidden-state magnitude growth and produces a more uniform gradient distribution across depth, pointing to training-stability benefits beyond raw efficiency.

What changed in the residual path?

Moonshot is pitching Attention Residuals as a drop-in replacement for standard residual stacks. Instead of every layer receiving the same accumulated history, the launch thread says each layer can "selectively retrieve past representations" through learned attention over prior layers.

That is the main architectural shift: residual connections stop being fixed depth-wise recurrence and become an attention problem over model depth. Moonshot says Block AttnRes makes that usable at scale by grouping layers into compressed blocks, and the

from a supporting explainer shows the intended tradeoff clearly: full lookback behavior with a blockwise memory reduction. The company links the full method description in its paper.

A practitioner summary from Cedric Chee's note captures the practical pitch in plain language: old stacks "keep piling layers on top," while Attention Residuals let a layer "look back" and pull the most useful earlier state.

What are the reported gains, and what is still unproven?

Moonshot's benchmark claim is narrow but concrete. Its scaling-law post says experiments show a "consistent 1.25× compute advantage" across model sizes, and the main announcement adds that the latency cost is negligible at "<2%" during inference.

The training-side argument is that learned lookbacks improve optimization, not just efficiency. According to the training-dynamics post, AttnRes mitigates hidden-state growth and yields a "more uniform gradient distribution across depth." The same thread says results were validated on Kimi Linear, with 48B total parameters and 3B activated, and Moonshot reports "consistent downstream performance gains" there launch thread.

What Moonshot has not shown in the thread is a broad reproduction set outside Kimi Linear or a detailed public accounting of memory costs under different stack depths and block settings. The strongest current evidence remains Moonshot's own paper and thread-level summaries, including a supporting recap from AlphaSignalAI's summary that repeats the headline numbers but does not add independent results.

🧾 More sources

What changed in the residual path?2 tweets

Covers the core mechanism: replacing fixed residual accumulation with attention over prior layers, plus the blockwise implementation detail.

What are the reported gains, and what is still unproven?1 tweets

Groups the benchmark, latency, and training-dynamics claims, while separating Moonshot's reported results from open questions about external validation.

breakingMarch 15, 2026

Moonshot introduces Attention Residuals with 1.25x compute gains on Kimi Linear

Inference Optimization Benchmarks

3 min read

TL;DR

Moonshot says its new launch thread swaps fixed residual accumulation for learned, input-dependent lookbacks over earlier layers, turning residual paths into a selective retrieval mechanism instead of a uniform sum.
In Moonshot's scaling-law post, Attention Residuals show a consistent 1.25x compute advantage across model sizes, while the main announcement says inference overhead stays below 2%.
The company says the method was validated on Kimi Linear, a 48B-parameter sparse architecture with 3B activated parameters, where the announcement reports consistent downstream gains.
Moonshot's training-dynamics post claims the approach also curbs hidden-state magnitude growth and produces a more uniform gradient distribution across depth, pointing to training-stability benefits beyond raw efficiency.

What changed in the residual path?

Kimi.ai

@Kimi_Moonshot

·Follow

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with Show more

3:03 AM · Mar 16, 2026

13.6K

Read 332 replies

from a supporting explainer shows the intended tradeoff clearly: full lookback behavior with a blockwise memory reduction. The company links the full method description in its paper.

What are the reported gains, and what is still unproven?

Kimi.ai

@Kimi_Moonshot

·Follow

Replying to @Kimi_Moonshot

Scaling law experiments reveal a consistent 1.25× compute advantage across varying model sizes.

3:03 AM · Mar 16, 2026

808

Read 5 replies

Kimi.ai

@Kimi_Moonshot

·Follow

Replying to @Kimi_Moonshot

Analysis of training dynamics demonstrates how AttnRes naturally mitigates hidden-state magnitude growth and yields a more uniform gradient distribution across depth.

3:03 AM · Mar 16, 2026

465

Read 7 replies

🧾 More sources

What changed in the residual path?2 tweets

Covers the core mechanism: replacing fixed residual accumulation with attention over prior layers, plus the blockwise implementation detail.

What are the reported gains, and what is still unproven?1 tweets

Groups the benchmark, latency, and training-dynamics claims, while separating Moonshot's reported results from open questions about external validation.