Skip to content
AI Primer
breaking

FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

2 min read
FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13
FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

TL;DR

  • FlashAttention-4 is aimed squarely at Blackwell’s new imbalance: tensor-core throughput doubled, while shared-memory bandwidth and exponential units did not, so attention kernels now bottleneck on data movement and softmax work, according to the paper thread.
  • The reported fix is a co-design of algorithm and kernel scheduling: redesigned asynchronous pipelines, software-emulated exponential work, and lower shared-memory traffic in backward pass, as described in the abstract screenshot.
  • On B200 with BF16, the authors report up to 1.3x over cuDNN 9.13 and 2.7x over Triton, reaching 1613 TFLOPs/s and about 71% utilization in the benchmark thread.
  • The implementation detail that may matter to teams extending kernels is that FlashAttention-4 was built in Python-embedded CuTe-DSL, where the abstract screenshot claims 20-30x faster compile times than traditional C++ template approaches.

What changed on Blackwell

FlashAttention-4 is not pitched as a generic attention refresh. It is a Blackwell-specific response to “asymmetric hardware scaling,” where matrix math got much faster but memory movement and non-matmul units did not keep up, as the abstract screenshot spells out. That shifts the bottleneck away from pure compute and toward shared-memory traffic, softmax, and other non-matmul operations.

The paper summary in the thread says the kernel attacks that bottleneck three ways: overlapping math and memory loading with a new asynchronous schedule, moving some exponential work into software-emulated paths, and using tensor memory plus 2-CTA MMA mode to cut shared-memory traffic and atomic adds in backward pass. The same paper thread says those changes push B200 to 1600+ TFLOPs/s, ahead of both cuDNN and Triton on the reported setup.

One implementation detail stands out beyond the speedup number. The abstract screenshot says the whole kernel was written in CuTe-DSL embedded in Python, with 20-30x faster compile times than C++ template-based implementations while keeping full expressivity. For engineers tuning long-context inference or training on B200 and GB200, that makes this story about iteration speed as much as raw throughput.

Share on X