breakingMarch 15, 2026

FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

LLM Serving Inference Optimization GPU Infrastructure

2 min read

FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

TL;DR

FlashAttention-4 is aimed squarely at Blackwell’s new imbalance: tensor-core throughput doubled, while shared-memory bandwidth and exponential units did not, so attention kernels now bottleneck on data movement and softmax work, according to the paper thread.
The reported fix is a co-design of algorithm and kernel scheduling: redesigned asynchronous pipelines, software-emulated exponential work, and lower shared-memory traffic in backward pass, as described in the abstract screenshot.
On B200 with BF16, the authors report up to 1.3x over cuDNN 9.13 and 2.7x over Triton, reaching 1613 TFLOPs/s and about 71% utilization in the benchmark thread.
The implementation detail that may matter to teams extending kernels is that FlashAttention-4 was built in Python-embedded CuTe-DSL, where the abstract screenshot