breakingMarch 16, 2026

DistCA claims 1.35x long-context training gains with disaggregated core attention

Researchers released DistCA, a training system that offloads stateless core attention to dedicated servers and reports up to 1.35x throughput gains on long-context workloads. Evaluate it for very long-sequence training where attention imbalance strands GPUs and creates pipeline stalls.

LLM Serving Inference Optimization GPU Infrastructure

3 min read

DistCA claims 1.35x long-context training gains with disaggregated core attention

TL;DR

HAO AI Lab introduced DistCA, a long-context training system that treats core attention as a separate service and says the stateless softmax(QKᵀ)V step can be offloaded to dedicated attention servers DistCA thread.
The pitch is a systems one: at long sequence lengths, attention grows quadratically while most other work is closer to linear, so equal token splits can still leave some GPUs stuck on much heavier batches and create cluster-wide idle time imbalance thread.
In the team’s reported results, DistCA delivers “almost 2x” speedup versus Megatron in its thread and up to 1.35x over state-of-the-art methods in the linked paper summary, with experiments described on 512 H200 GPUs and contexts up to 512K tokens reported gains.
The broader theme is disaggregation: alongside a separate