FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

FlashAttention-4 is not pitched as a generic attention refresh. It is a Blackwell-specific response to “asymmetric hardware scaling,” where matrix math got much faster but memory movement and non-matmul units did not keep up, as the abstract screenshot spells out. That shifts the bottleneck away from pure compute and toward shared-memory traffic, softmax, and other non-matmul operations.
The paper summary in the thread says the kernel attacks that bottleneck three ways: overlapping math and memory loading with a new asynchronous schedule, moving some exponential work into software-emulated paths, and using tensor memory plus 2-CTA MMA mode to cut shared-memory traffic and atomic adds in backward pass. The same paper thread says those changes push B200 to 1600+ TFLOPs/s, ahead of both cuDNN and Triton on the reported setup.
One implementation detail stands out beyond the speedup number. The abstract screenshot says the whole kernel was written in CuTe-DSL embedded in Python, with 20-30x faster compile times than C++ template-based implementations while keeping full expressivity. For engineers tuning long-context inference or training on B200 and GB200, that makes this story about iteration speed as much as raw throughput.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
The paper introduces FlashAttention-4 to make AI run faster on the newest generation of computer chips. Researchers from Princeton University, Meta, NVIDIA, and more have developed clever new pipelines, re-engineered core computations, and optimized memory usage to master the Show more