breakingMarch 15, 2026

FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

2 min read

FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

TL;DR

FlashAttention-4 is aimed squarely at Blackwell’s new imbalance: tensor-core throughput doubled, while shared-memory bandwidth and exponential units did not, so attention kernels now bottleneck on data movement and softmax work, according to the paper thread.
The reported fix is a co-design of algorithm and kernel scheduling: redesigned asynchronous pipelines, software-emulated exponential work, and lower shared-memory traffic in backward pass, as described in the abstract screenshot.
On B200 with BF16, the authors report up to 1.3x over cuDNN 9.13 and 2.7x over Triton, reaching 1613 TFLOPs/s and about 71% utilization in the benchmark thread.
The implementation detail that may matter to teams extending kernels is that FlashAttention-4 was built in Python-embedded CuTe-DSL, where the abstract screenshot claims 20-30x faster compile times than traditional C++ template approaches.

What changed on Blackwell

FlashAttention-4 is not pitched as a generic attention refresh. It is a Blackwell-specific response to “asymmetric hardware scaling,” where matrix math got much faster but memory movement and non-matmul units did not keep up, as the abstract screenshot spells out. That shifts the bottleneck away from pure compute and toward shared-memory traffic, softmax, and other non-matmul operations.

The paper summary in the thread says the kernel attacks that bottleneck three ways: overlapping math and memory loading with a new asynchronous schedule, moving some exponential work into software-emulated paths, and using tensor memory plus 2-CTA MMA mode to cut shared-memory traffic and atomic adds in backward pass. The same paper thread says those changes push B200 to 1600+ TFLOPs/s, ahead of both cuDNN and Triton on the reported setup.

One implementation detail stands out beyond the speedup number. The abstract screenshot says the whole kernel was written in CuTe-DSL embedded in Python, with 20-30x faster compile times than C++ template-based implementations while keeping full expressivity. For engineers tuning long-context inference or training on B200 and GB200, that makes this story about iteration speed as much as raw throughput.