LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

arXiv: 2605.18739 · HF: paper page · Tier: 1 (FP4 quantization, video generation, Blackwell-native serving)

TL;DR

LongLive-2.0 is the first end-to-end NVFP4 training and inference system for long video generation. Training: sequence-parallel autoregressive (AR) training instantiated as Balanced SP, pairing clean-history and noisy-target temporal chunks on each rank with SP-aware chunked VAE encoding. The teacher-forcing layout co-designs with SP execution to avoid imbalance. Combined with NVFP4 precision it reduces GPU memory and accelerates GEMM as video length grows. Inference: W4A4 NVFP4 on Blackwell, KV cache quantised to NVFP4, asynchronous streaming VAE decoding. Non-Blackwell GPUs run SP inference to match Blackwell speed, with quantised KV cache lowering SP inter-GPU communication. Up to 2.15x training speedup and 1.84x inference speedup. LongLive-2.0-5B reaches 45.7 FPS while holding benchmark performance. Unlike prior Self-Forcing series methods that need ODE initialization plus distribution matching distillation, LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive AR diffusion model, with an optional standalone LoRA that converts to real-time 4-to-2-step generation.

Key findings

The bottleneck for long video generation is the combination of memory cost (long sequences) and GEMM speed (the proportion of compute spent in GEMM rises with video length).
Balanced SP is the key training-side move: pair clean-history and noisy-target temporal chunks on each rank so that the natural teacher-forcing mask becomes SP-aware. Combined with SP-aware chunked VAE encoding, this gives a clean training pipeline that scales with video length.
NVFP4 in training accelerates GEMM and reduces memory; in inference, W4A4 NVFP4 plus an NVFP4-quantised KV cache and asynchronous streaming VAE decoding give end-to-end speedup on Blackwell.
Non-Blackwell GPUs use SP inference (sequence parallel) to recover similar throughput, and the NVFP4 KV cache also reduces inter-GPU communication during SP inference.
2.15x training speedup, 1.84x inference speedup, 45.7 FPS at the 5B configuration.
LongLive-2.0 bypasses ODE initialisation and distribution matching distillation: it directly tunes a diffusion model into a long, multi-shot, interactive AR diffusion model. Real-time generation (4 to 2 denoising steps) is available as a standalone LoRA.

Relationship to prior wiki entries

LongLive-2.0 is the first wiki entry where NVFP4 (NVIDIA's FP4 format used end-to-end on Blackwell) is the dominant numerical format across training and inference of a generative model. The wiki has tracked Blackwell adoption through NVIDIA's Vera CPU thread (2026-05-19 morning social-stream, the Vera CPU hand-delivery to Anthropic, OpenAI, SpaceXAI, Oracle Cloud) and SemiAnalysis hardware coverage, but until today no paper had operationalised NVFP4 end-to-end with reproducible numbers.

The KV-cache angle composes with the wiki's running KV thread. TurboQuant (2026-04-22, the Google ICLR 2026 KV cache quantizer using random rotation plus per-coordinate optimal scalar quantizers plus 1-bit QJL residuals, achieving 6x+ memory reduction at 3.5 bits/channel) is the text-side cousin; LongLive-2.0's NVFP4 KV cache is the video-side application. Forcing-KV (2026-05-15, the video diffusion KV compression that exploits the static-vs-dynamic head functional split) attacks the same memory bottleneck via a different mechanism (head-role-aware pruning). LongLive-2.0 and Forcing-KV could compose: head-role-aware NVFP4 quantization would allow more aggressive precision on static heads.

The Self-Forcing bypass is the second structural finding. Prior Self-Forcing series methods needed ODE initialization plus distribution matching distillation (a two-stage process where you first match the ODE trajectory then distill). LongLive-2.0 shows you can tune a diffusion model directly into a long, multi-shot AR diffusion model. If this generalises, the entire two-stage Self-Forcing pipeline simplifies for downstream researchers.

Why it matters

This is the wiki's first concrete demonstration that a frontier-grade generative-model training and inference stack can run entirely in FP4 with measurable wall-clock speedups (2.15x training, 1.84x inference) at preserved benchmark performance. The implications go beyond video: text LLMs targeting Blackwell B200 / B300 will need similar end-to-end FP4 stacks to extract the new hardware's throughput. LongLive-2.0 is the first system-paper template for what that stack looks like in practice. The non-Blackwell SP-inference fallback is the deployment realism: not every operator has Blackwell yet, but the same model can serve on Hopper-class hardware with SP recovery.

Research angle

Does the Balanced SP layout generalise to text LRMs? The teacher-forcing-clean-history-plus-noisy-target pattern is video-specific in framing, but the underlying SP-aware mask design might map to text long-context training. Diagnostic: apply Balanced SP to long-context text pre-training and measure SP imbalance versus naive layouts.
NVFP4 KV cache versus TurboQuant for text. TurboQuant gives 3.5 bits/channel with neutrality and 2.5 bits/channel with marginal degradation. NVFP4 is 4 bits but native to Blackwell hardware. On Blackwell, NVFP4 should win on wall-clock; on Hopper, TurboQuant should win on memory. The Pareto frontier needs to be drawn.
Real-time generation via standalone LoRA. The 4-to-2 denoising-step LoRA is a discrete operating point. Whether intermediate LoRAs (3-step) exist on a continuous frontier and what the FPS-to-quality tradeoff looks like is the deployment-relevant tradeoff.

Source

raw/huggingface/2026-05-19-longlive-20-an-nvfp4-parallel-infrastructure-for-long-video-.md