breakingMarch 25, 2026

Google Research launches TurboQuant: 6x KV-cache compression, 8x faster H100 attention

TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.

LLM Serving Inference Optimization KV Cache GPU Infrastructure

4 min read

Google Research launches TurboQuant: 6x KV-cache compression, 8x faster H100 attention

TL;DR

Google Research says TurboQuant can compress LLM KV caches to 3 bits with no retraining, cutting memory by about 6x while matching full-precision results on long-context and retrieval benchmarks, according to Google's launch and a benchmark recap.
The headline serving claim is speed: Google's research post reports up to 8x faster attention computation on H100s, while the accompanying chart thread shows speedups rising with longer sequence lengths.
This is not just a KV-cache story. Google's launch post