inference-efficiency · 2026-05-05 · Tier 1

MotionCache: Motion-Aware Caching for Efficient Autoregressive Video Generation

MotionCache: Motion-Aware Caching for Efficient Autoregressive Video Generation

Source: HuggingFace Daily Papers, 2026-05-05 Paper: arXiv:2605.01725 · HF page Raw: raw/huggingface/2026-05-05-motion-aware-caching-efficient-autoregressive-video-generation.md Tier: 1, inference efficiency, KV-cache analogue for video

TL;DR

MotionCache is a training-free framework that accelerates autoregressive video diffusion by reusing denoising computation across temporally similar regions. It uses inter-frame differences to decide which pixels need full denoising and which can skip steps, with a two-phase schedule: a warm-up that establishes semantic consistency, then motion-weighted cache reuse with dynamic update frequencies. On SkyReels-V2: 6.28x speedup with 1% VBench drop. On MAGI-1: 1.64x with 0.01% drop. Code is public.

Why it matters

Autoregressive video generation has the same iterative-denoising bottleneck as long-context LLM inference. MotionCache treats the per-pixel iteration count as a variable, just as KV-cache compression treats per-token attention recomputation as a variable. The mechanism is a direct video-domain analogue of selective KV-cache reuse: identify the parts of the workload where prior computation is reusable, and skip iterations there. The 6.28x number on SkyReels-V2 is striking because video AR models have been considered impractical for production specifically because of the iteration cost.

Connections

  • KV cache (concept page) — MotionCache is the video-AR analogue of KV-cache reuse. Inter-frame differences play the role of attention saliency: the regions where motion is small are the regions where prior computation is reusable. This is a fourth pattern in the wiki's caching thread (KV Packet 04-17, TurboQuant 04-22, PrfaaS 04-22, MotionCache 05-05). Four papers, four modalities of cached computation: text KV reuse, KV quantization, KV transport across datacenters, video denoising reuse. The shared principle is that the iteration unit (token, KV row, KV transfer, denoising step) has heterogeneous information density and should be allocated proportionally.
  • Speculative decoding (concept page) — MotionCache is structurally similar to SDVG (04-22), which extended speculative decoding to video by accepting whole blocks based on quality scores. Both reduce iteration count while preserving sampled quality. SDVG works at the block level via a small drafter; MotionCache works at the pixel level via motion delta thresholds. They could compose (drafter proposes blocks, motion-cache reuses denoising within each block).
  • Worth Watching resolution — the 04-22 KV-cache cluster (PrfaaS, TurboQuant, SDVG) implicitly predicted that "the iteration unit is the optimization unit" generalizes beyond text. MotionCache is the explicit confirmation.

Research angle

  1. Joint MotionCache + SDVG composition. SDVG handles inter-block speedup, MotionCache handles intra-block. Joint speedup could compound past 10x on long videos with no quality loss in the operating range.
  2. Salience-driven cache eviction transfer. The same logic that picks "which pixels can skip" could pick "which KV rows can be evicted." Most KV eviction policies today are LRU or magnitude-based; motion-style temporal-similarity scoring is unexplored at the KV level.
  3. Long-video tail behavior. The benchmarks here are short. Whether the cache strategy degrades gracefully at minute-scale generation, where motion accumulates and warm-up no longer dominates, is the load-bearing test for production deployment.

Open questions

  • Does MotionCache generalize to other AR video paradigms (Genie 3, NVA, MAGI families) without re-tuning the warm-up phase?
  • What is the interaction with classifier-free guidance? CFG doubles per-step cost; MotionCache reduces step count. The composition has not been measured.
  • Is the motion delta itself a good proxy for "denoising signal," or does it correlate poorly in scenes with rapid cuts and high local motion (action sequences)? Quality drops on out-of-distribution video clips are not reported.