TurboQuant: Online Vector Quantization for KV Cache Compression
Date: 2026-04-22
Source: HuggingFace Daily Papers / ICLR 2026
Paper: arxiv 2504.19874
Raw: (parallel daily digest 2026-04-22)
TL;DR
TurboQuant (Google, ICLR 2026) achieves 6x KV cache memory reduction at 2.5 bits per channel with near-zero downstream quality loss. The method works online (no offline calibration), uses random rotation to shape input distributions into something quantization-friendly, then applies optimal scalar quantizers per coordinate plus a 1-bit QJL residual for unbiased inner products. Community integrations with vLLM and llama.cpp are already appearing despite no official implementation.
Method
Input KV vector
│
Random rotation (input-agnostic, computed once)
│
Concentrated Beta distribution on coordinates
│
Optimal scalar quantizer per coordinate (MSE-optimal for Beta)
│
1-bit Quantized Johnson-Lindenstrauss transform on residual
│
Unbiased inner product quantizer
→ 2.5 bits/channel, 6x memory reduction
The core insight: raw KV vectors have awkward distributions that don't quantize cleanly. A random rotation transforms them into a predictable concentrated Beta distribution. Once the distribution is predictable, optimal scalar quantization per coordinate is straightforward — the optimal quantizer for a Beta distribution is known analytically. The 1-bit QJL transform on the residual preserves unbiasedness of inner products (which is what attention actually computes), so quality degradation stays minimal.
Key results:
- 3.5 bits/channel: absolute quality neutrality on downstream benchmarks
- 2.5 bits/channel: marginal degradation, 6x memory reduction
- Online: no offline calibration pass needed — quantizer computed from the random rotation alone
Relation to Prior Wiki Pages
Extends KV cache concept page: Previous entries (KV Packet 04-17, LongAct 04-18) focused on reuse and gradient targeting of cached entries. TurboQuant attacks the cache from a different angle: shrink the entries themselves. Composable with KV Packet — you could reuse compressed cached packets.
Pairs with PrfaaS (04-22): PrfaaS achieves KV volume reduction by using hybrid-attention models. TurboQuant achieves it by compression. Different levels of the stack; both address the same throughput bottleneck.
Extends knowledge distillation / compression lineage: The quantization approach is post-training and training-free — closer to inference-time compression than distillation.
Open Questions
- What is the actual acceptance rate of community vLLM integrations in production? The gap between ICLR-quality benchmarks and production edge cases (very long sequences, domain-specific vocabularies) is unknown.
- Does the 2.5-bit guarantee hold across model architectures with different attention patterns (GQA, MQA vs MHA)?
- How does TurboQuant compose with speculative decoding? If the draft model's KV cache is compressed but the target's is not, does the quality of acceptance decisions degrade?