Nous Research releases Lighthouse Attention: 1.4-1.7x faster pretraining at 98K context
Nous Research published Lighthouse Attention, a hierarchical selection layer that keeps the standard attention kernel while cutting end-to-end pretraining wall clock by 1.4-1.7x at 98K context. It also scales to 1M-token training across 32 Blackwell GPUs without a custom sparse kernel.

TL;DR
- Nous Research says Lighthouse Attention cuts end-to-end pretraining wall clock by 1.4 to 1.7 times at 98K context, and the same setup runs forward plus backward about 17 times faster than standard attention at 512K on a single B200, according to NousResearch's launch thread and NousResearch's scaling details.
- The core design keeps the inner attention kernel unchanged: pooled Q, K, and V build a pyramid, a top-k cascade picks a dense causal subsequence, and standard attention runs on that subsequence, as NousResearch's mechanism breakdown describes.
- Nous argues the interesting bit is architectural decoupling, because the selection logic sits outside the kernel, so upstream hardware and software attention improvements carry over automatically, per NousResearch's follow-up post.
- The team says the method scales past 100K context under context parallelism to 1M-token training across 32 Blackwell GPUs, with about 10 percent per-rank throughput overhead versus single-device extrapolation, according to NousResearch's context-parallel note.
- Nous also posted an erratum saying the paper's comparison text mentions FA3 and FA4, but the released experiments actually compare cuDNN plus Lighthouse against cuDNN plus SDPA, as NousResearch's erratum post notes.
Nous linked the paper, code, and blog post right in the launch thread via NousResearch's resource links. The numbers are unusually specific for a sparse-attention launch: 75 to 106 B200-hours saved per 50B-token run, a lower recovered loss than the dense-from-scratch baseline, and a claim that the dense model still works after a brief full-attention resume, all surfaced across NousResearch's scaling details and NousResearch's mechanism breakdown.
Selection layer
Lighthouse is a selection layer wrapped around standard scaled dot-product attention. Q, K, and V are average-pooled into a multi-level pyramid, each level is scored by head, and a coarse-to-fine top-k cascade chooses survivors.
Those survivors are gathered into a contiguous subsequence, sorted to preserve causality, then fed through ordinary dense attention. That design choice is the whole pitch: sparse selection without rewriting the kernel.
The thread breaks the layer into four stages:
- Pool Q, K, and V into an L-level pyramid.
- Score entries with per-head norms.
- Run a coarse-to-fine top-k cascade.
- Gather, causally sort, attend densely, then scatter outputs back.
Training recipe
Nous treats the recovery step as the load-bearing claim. Most of training runs with Lighthouse selection enabled, then a short tail disables selection and resumes under full standard attention with the same optimizer state and dataloader continuation.
The reason is straightforward: the checkpoint still has to behave like a dense-attention model at inference time. According to NousResearch's training-recipe post, sparse pretraining is only useful here if the model survives that handoff cleanly.
Context parallelism
Beyond roughly 100K context, Nous says Lighthouse runs under context parallelism with shard-local pyramid pooling, scoring, and top-k. Because the gathered subsequence stays dense, it can slot into ring attention without sparse-aware collectives.
The reported setup reaches 1M-token training across 32 Blackwell GPUs, four nodes at CP degree 8. NousResearch's scaling details also says the per-rank throughput overhead is about 10 percent versus a single-device extrapolation.
The same post adds the concrete throughput and cost numbers:
- At 512K context on one B200, Lighthouse runs the same forward plus backward pass about 17 times faster than standard attention.
- At 98K context, end-to-end pretraining speedup lands at 1.4 to 1.7 times wall clock.
- Over a 50B-token run, that translates to 75 to 106 B200-hours saved.
- Stage-1 throughput reaches 84k to 126k tokens per second per GPU, versus about 46k for dense SDPA.
Kernel decoupling
Nous' sharpest opinion is that more sparse-attention systems should be built without custom kernels. In NousResearch's decoupling post, the team says separating selection logic from the attention kernel lets Lighthouse inherit future kernel improvements automatically, while also keeping training and inference on the same kernel.
That is a cleaner story than many sparse-attention papers manage to tell. The launch keeps coming back to the same engineering constraint: if the sparse trick forces a bespoke kernel path, every upstream improvement becomes somebody else's integration problem.
Benchmark erratum
The launch thread also carries a paper correction. Nous says the paper text refers to FA3 and FA4, but the actual comparison experiments were run as cuDNN plus Lighthouse versus cuDNN plus SDPA, and that the released code reflects that setup.
That matters because Lighthouse's sales pitch is partly about not needing a custom sparse kernel. The correction narrows the benchmark framing to the comparison Nous says it actually ran, while the public release still includes the paper, code, and blog post linked by NousResearch's resource links.