KV Cache
The KV cache (Key-Value cache) stores the key and value tensors from the attention mechanism for tokens already processed. This means those tokens don't need to be recomputed on every new generation step — critical for making autoregressive decoding fast.
Current State (as of 2026-05-19)
Latest additions (2026-05-19): CompactAttention, EndPrompt, LongLive-2.0 NVFP4 KV cache. Three KV-relevant entries today. CompactAttention (summary) attacks the chunked-prefill regime where prior block-sparse machinery loses efficiency because the chunk size caps the Q-length. The structural move: treat 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, then convert them via Q-block union and intra-group GQA union into minimal block tables under paged execution. 2.72x attention speedup at 128K context on LLaMA-3.1-8B-Instruct with near-dense RULER accuracy. EndPrompt (summary) extends an LLM's context window using only short training sequences via a two-segment construction (original short context as segment 1, brief terminal prompt as segment 2 with positional indices placed near the target length). Beats LongLoRA and full-length fine-tuning on RULER and LongBench at substantially lower compute, with a RoPE-and-Bernstein-inequality smoothness argument for why sparse positional supervision suffices. LongLive-2.0 (summary) is the wiki's first end-to-end NVFP4 video training and inference stack. Quantizes the KV cache to NVFP4 on Blackwell for memory savings; on non-Blackwell, SP inference with quantized KV cache lowers SP inter-GPU communication. 2.15x training speedup, 1.84x inference speedup, 45.7 FPS at 5B.
Prior State (as of 2026-05-16)
Addition (2026-05-16): Lighthouse Attention pre-training wrapper. Nous Research ships a training-only, kernel-decoupled wrapper around ordinary FlashAttention for causal-transformer pre-training at extreme context length. Queries, keys, and values are pooled symmetrically into a multi-resolution pyramid; a gradient-free top-k cascade selects a hierarchical dense sub-sequence; a sorting pass keeps left-to-right causality. The wrapper is removed in a short recovery phase, leaving a standard dense-attention model. Claimed 1.4-1.7x wall-clock speedup at 98K context and ~17x forward+backward at 512K on a single B200. The structural novelty: pre-training attention selection is now a programmable substrate, the same framing the wiki has applied to the inference-time cache. → summary
Prior State (as of 2026-05-15)
Latest addition (2026-05-15): Forcing-KV for autoregressive video diffusion + async continuous batching. Two pieces of the inference stack land the same day. Forcing-KV finds that attention heads in AR video diffusion models (Self Forcing family) cluster into two stable functional roles across samples and denoising steps: static heads (chunk transitions, intra-frame fidelity) tolerate structured pruning; dynamic heads (inter-frame motion, temporal consistency) require segment-similarity-based pruning. Role-conditioned hybrid compression delivers 29+ fps on single H200 at 30% memory reduction, 1.35-1.50x speedup at 480P scaling to 2.82x at 1080P. The cache thread is now policy-aware in three forms: learned eviction (Make Each Token Count), shared coordination (Orthrus), and head-role compression (Forcing-KV). → summary. The HuggingFace asynchronous continuous batching post is the scheduling-layer complement: three CUDA streams (H2D, compute, D2H), CUDA events for handoff, two parallel buffer slots A/B so the CPU prepares batch N+1 while the GPU computes batch N. GPU utilization rises from 76.0% to 99.4%, 22% generation speedup, no kernel or model changes. → summary
Prior state (as of 2026-05-14)
2026-05-14: Orthrus dual-view diffusion on shared cache. Orthrus runs an autoregressive head and a diffusion head on the same frozen LLM, both attending to a single shared KV cache. The AR head executes pre-fill and populates the cache at full fidelity; the diffusion head reads from that same cache to draft tokens in parallel; an exact-consensus mechanism between the two views makes the output bit-identical to the AR baseline. Up to 7.8x speedup with O(1) memory cache overhead. The structural novelty: the cache is the shared coordination object, not a verification ledger. Composes naturally with Make-Each-Token-Count: same cache, both selectively retained and parallelly read. → summary
Companion (2026-05-14): MMProLong long-context VLM recipe. First long-context VLM training recipe in the wiki. Three findings: long-document VQA beats OCR transcription; balanced sequence-length distribution beats target-length-focused; retrieval is the long-context bottleneck. 5B-token long-context continued pretrain extends Qwen2.5-VL-7B from 32K to 128K with generalization to 256K and 512K. The training-side complement to Make-Each-Token-Count's inference-side claim. Both say: long context rewards balance and structure, not volume. → summary
2026-05-13 additions: δ-mem and FocuSFT. Two papers attack long-context inefficiency from orthogonal angles. δ-mem augments a frozen full-attention backbone with a compact 8x8 associative-memory state updated by the delta rule; its readout produces low-rank corrections to the backbone's attention computation. 1.10x average gain over the frozen backbone, 1.31x on MemoryAgentBench, 1.20x on LoCoMo. Composes with Make Each Token Count: aggressive eviction at the cache, associative signal retained in the small online state. FocuSFT identifies attention sinks as a training-side phenomenon (not just inference), shows that standard long-context SFT lets positional biases starve content tokens of attention budget, and fixes it with bilevel optimization (inner loop sharpens attention via fast-weights, outer loop runs SFT conditioned on sharpened representation; bidirectional context with causal response masking removes the sink-creating asymmetry). Up to +14 points on BABILong, 529x sink-mass reduction. Make Each Token Count and FocuSFT together bracket attention dilution: training-side cause + inference-side fix. → δ-mem summary · FocuSFT summary
Prior addition (2026-05-12): Make Each Token Count. A learned, globally calibrated KV-eviction policy that can surpass the full cache, not just approximate it. The framing flip: in long contexts, the full cache is not the ceiling because irrelevant tokens dilute attention. Lightweight retention gates score each cached entry, a shared final scoring projection calibrates scores across every layer and head, and a single global memory budget lets tokens from different layers, heads, and modalities compete for cache capacity. Theoretical analysis shows that preferentially retaining useful tokens reduces attention dilution. This is the language-model analogue of Stream-T1's content-aware video KV eviction (2026-05-07): both treat eviction as a quality intervention, not a compression tradeoff. Composes with MISA (head-axis routing) and TurboQuant (low-bit quantization). → summary
Prior additions (2026-05-11): Two papers attack long-context inference from different angles on the same day. MISA introduces a Mixture of Indexer Sparse Attention: it treats the 64 query heads inside DeepSeek Sparse Attention's indexer as an MoE pool and routes a small active subset (h=8) per query via cheap block-level statistics. Drop-in, no extra training, 92 percent of the tokens DSA would have selected, 3.82x kernel speedup over the original DSA indexer kernel on H200. The new axis here is the head axis: prior sparse-attention work routed on tokens, MISA routes on indexer heads. UniPrefill ships as a vLLM operator with extended continuous-batching scheduling: block-wise dynamic sparsification at the token level that is architecture-agnostic (works on full attention, linear-and-full hybrids, sliding-window hybrids). Up to 2.1x TTFT with the speedup growing as concurrent request count grows, which is the signature of a serving-system optimization. MDN: Momentum DeltaNet is the substrate-level update inside linear attention: it parallelizes stepwise momentum updates via geometric reordering, then uses spectral analysis of the resulting second-order recurrence to constrain gating for stability. Comparable training throughput to Mamba2 and KDA at 400M and 1.3B, beats Transformer / Mamba2 / GDN across downstream tasks. The recurrent rule is now a research surface, not a fixed substrate.
Prior state (as of 2026-05-07)
KV caching is standard in all production LLM serving. Active research is focused on four problems: (1) making caches reusable across contexts without recomputation, (2) compressing the cache to reduce memory footprint, (3) smarter eviction policies when the cache is full, and (4) extending cache-based acceleration patterns (like speculative decoding) to non-text modalities. The parallel daily digest (04-22) introduced two major KV-focused papers, TurboQuant (ultra-low-bit compression) and PrfaaS (cross-datacenter disaggregation via hybrid attention), signaling that the KV cache is now the primary optimization target in production serving. MotionCache (2026-05-05) extends the same iteration-as-optimization-unit principle to autoregressive video generation, using inter-frame motion deltas to decide which pixels need full denoising. Stream-T1 (2026-05-07) introduces the first content-aware KV eviction policy in the wiki: for streaming video diffusion, KV slots are routed through reward-feedback pathways instead of recency-based eviction. LIVEditor / ISA (2026-05-07) routes attention by Query sharpness, sending high-error queries to full attention and low-error queries to a 0-th order Taylor sparse path, achieving ~60% attention-module latency reduction on video editing. The pattern is now visible across six substrates: text KV reuse (KV Packet), KV quantization (TurboQuant), KV transport (PrfaaS), video denoising reuse (MotionCache), content-aware video KV eviction (Stream-T1), and Query-sharpness sparse attention (ISA). The shared principle is that the iteration unit has heterogeneous information density and should be allocated proportionally.
Economic context (SemiAnalysis 05-01): the unit economics of frontier model labs now depend on >90% prompt-cache hit rates. Anthropic's blended price for Opus 4.7 on agentic workloads is ~$0.99/MTok (vs $5/$25 sticker) because cached input tokens dominate. Cache compression / reuse research is now financial-impact-driven, not just academic.
Key Papers
Make Each Token Count (2026-05-12) — Learned, globally calibrated KV-cache eviction with retention gates per cached entry, a shared final scoring projection that calibrates scores across all layers/heads, and a single unified memory budget across layers, heads, and modalities. The theoretical claim is that the full cache is not optimal in long contexts because irrelevant tokens dilute attention away from useful evidence; selective eviction reduces dilution. Matches or surpasses full-cache inference across long-context language, vision-language reasoning, and multi-turn dialogue benchmarks. First paper in the wiki to formally claim eviction improves quality, not just preserves it. → summary
MISA (2026-05-11) — Mixture of Indexer Sparse Attention. Treats the 64 indexer query heads of DeepSeek Sparse Attention as an MoE pool, a cheap block-level router picks h=8 active heads per query. Reduces per-query indexer cost from O(H^I * L) to O(h * L + H^I * M). Recovers 92 percent of DSA's selected tokens at 8x fewer active indexer heads, 3.82x kernel speedup on H200. Drop-in, no training. The first paper in the wiki to route sparse-attention on the head axis rather than the token axis. → summary
UniPrefill (2026-05-11) — Architecture-agnostic prefill accelerator via block-wise dynamic sparsification, implemented as a continuous-batching operator inside vLLM with native prefill-decode co-processing and tensor parallel. Up to 2.1x TTFT speedup, speedup grows with concurrent request count (a serving-system signature, not a single-request one). Works on hybrid architectures where prior sparse-attention prefill methods degrade. → summary
MDN: Momentum DeltaNet (2026-05-11) — Parallelizes stepwise momentum for delta linear attention via geometric reordering of update coefficients. Spectral analysis of the second-order recurrence constrains gating for stability. Triton kernel matches Mamba2 / KDA training throughput. At 400M and 1.3B, beats Transformer / Mamba2 / GDN on broad downstream evals. First substrate-level update to linear attention recurrent rule the wiki has tracked. → summary
KV Packet (2026-04-17) — Eliminates recomputation-on-reuse entirely. Wraps cached documents as immutable packets with lightweight soft-token adapters (trained via self-supervised distillation) that bridge context shifts. Near-zero FLOPs, lower TTFT than all recomputation-based baselines (CacheBlend, EPIC, SAM-KV). → summary
LongAct (2026-04-18) — Identifies high-magnitude activations in Q/K vectors during long-context processing. These "saliency peaks" (same ones that trouble quantization) are the positions where attention is doing real work. LongAct restricts RL gradient updates to only those weights, yielding ~8% gain on LongBench v2 with universal compatibility across GRPO and DAPO. Bridges the KV saliency insight from quantization research into RL training. → summary
TurboQuant (2026-04-22, via parallel digest) — Google (ICLR 2026). Online vector quantization: randomly rotates input vectors to induce a concentrated Beta distribution, applies optimal scalar quantizers per coordinate, followed by a 1-bit QJL transform on the residual for an unbiased inner product quantizer. Absolute quality neutrality at 3.5 bits/channel; marginal degradation at 2.5 bits/channel; 6x+ KV cache memory reduction. Community integrations with vLLM and llama.cpp appearing despite no official implementation.
PrfaaS / Prefill-as-a-Service (2026-04-22, via parallel digest) — Moonshot AI + Tsinghua. Offloads long-context prefill to standalone compute-dense clusters in separate datacenters, transfers resulting KV cache over Ethernet. Enabled by hybrid-attention models (Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B) that mix full-attention + linear-complexity layers. MiMo-V2-Flash produces KV cache at 4.66 Gbps vs 59.93 Gbps for dense-attention baseline (13x reduction). 54% higher throughput, 50% lower mean TTFT vs homogeneous baselines.
SDVG (2026-04-22) — Extends speculative decoding to continuous video generation. A 1.3B drafter proposes video blocks in 4 denoising steps; ImageReward scores per block using worst-frame aggregation; accepted blocks enter the 14B target's KV cache directly. 1.59x speedup at 98.1% quality; 2.09x at 95.7%. Training-free. → summary
MotionCache (2026-05-05) — Motion-aware caching for autoregressive video generation. Inter-frame differences identify which pixels require full denoising vs which can skip steps. Two-phase schedule: warm-up for semantic consistency, then motion-weighted cache reuse with dynamic update frequencies. 6.28x speedup on SkyReels-V2 (1% VBench drop), 1.64x on MAGI-1 (0.01% drop). Training-free, code public. The video-AR analogue of selective KV-cache reuse: iteration count is the optimization unit. → summary
Stream-T1 (2026-05-07) — Test-time scaling framework for streaming video generation. Three components: Stream-Scaled Noise Propagation (reuse high-quality previous-chunk noise as the prior for the next chunk), Stream-Scaled Reward Pruning (combine short-term spatial assessment with sliding-window long-term coherence), and Stream-Scaled Memory Sinking (route KV-cache evictions through reward-feedback-guided update pathways). The first content-aware KV eviction policy tracked by the wiki: not "which token is oldest" but "which token still anchors downstream quality." → summary
LIVEditor / ISA (2026-05-07) — In-context Sparse Attention for ICL video editing. Two stages: context pre-selection (prune low-saliency context tokens) plus dynamic Query routing (route high-error queries to full attention, low-error to a 0-th order Taylor sparse path). Empirically validates the claim that Query sharpness correlates with attention approximation error. ~60% attention-module latency reduction, near-lossless on EditVerseBench / IVE-Bench / VIE-Bench. The first sharpness-routed sparse attention in the wiki, structurally adjacent to language-side speculative decoding (route by difficulty). → summary
Key Concepts
- Context dependency: KV states computed for a document are specific to the attention context at the time. Reusing them in a new context produces attention distribution mismatch — hence the need to recompute.
- TTFT (Time-to-First-Token): the latency before the model outputs the first token. KV cache reuse directly impacts this.
- Soft-token adapters: trainable lightweight token representations that can modify how a cached KV state interacts with a new context, without recomputing the underlying states.
- Cache eviction: when the KV cache fills up, old entries must be evicted. Policy choices (LRU, saliency-based, etc.) affect quality and memory efficiency.