Researchers released DistCA, a training system that offloads stateless core attention to dedicated servers and reports up to 1.35x throughput gains on long-context workloads. Evaluate it for very long-sequence training where attention imbalance strands GPUs and creates pipeline stalls.

softmax(QKᵀ)V step can be offloaded to dedicated attention servers DistCA thread.DistCA’s core claim is that long-context training breaks because the expensive part of attention does not scale like the rest of the layer. HAO AI Lab’s imbalance thread says core attention is O(n²) while “everything else is roughly O(n),” so “same tokens per GPU” no longer means same work per GPU when document lengths vary.
The system response is Core Attention Disaggregation. In the team’s DistCA thread, the stateless core attention step is pulled out of the layer, split into token-level tasks, and sent to dedicated attention servers. The linked DistCA repo describes this as dynamic rebalancing for the softmax(QKᵀ)V path, with examples and benchmarking scripts for multi-node long-context runs.
The reported upside is mostly about reducing stragglers and pipeline bubbles rather than changing the model itself. The paper summary says DistCA uses ping-pong execution to overlap communication with compute and in-place execution on attention servers to cut memory use, aiming to keep utilization balanced as context length rises.
The headline number is up to 1.35x higher training throughput over prior methods, with the thread separately claiming “almost 2x speedup compared to Megatron” reported gains. The paper summary ties those results to a 512-H200 setup at context lengths up to 512K tokens, which makes this most relevant for teams already hitting long-sequence imbalance rather than general-purpose short-context training. As a systems pattern, it rhymes with Cedric Chee’s FlashSampling thread description of FlashSampling: fuse or disaggregate the expensive step so the bandwidth-bound bottleneck stops dominating runtime.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
At #NVIDIAGTC, Jensen showed the industry where AI infra is heading: disaggregate the stack. NVIDIA’s Groq LPX push applies this to inference with Attention–FFN Disaggregation. Our view: this idea matters even more for long-context LLM training. 🧵 github.com/hao-ai-lab/Dis…
📄Paper: "Efficient Long-context LLM Training via Core Attention Disaggregation" arxiv.org/abs/2510.18121 🌐Repo: github.com/hao-ai-lab/Dis…