PrfaaS: Prefill-as-a-Service via Cross-Datacenter KV Cache Transfer
Date: 2026-04-22
Source: HuggingFace Daily Papers
Paper: arxiv 2604.15039
Institution: Moonshot AI and Tsinghua University
Raw: (parallel daily digest 2026-04-22)
TL;DR
PrfaaS (Moonshot AI + Tsinghua) offloads long-context prefill computation to separate compute-dense clusters in different datacenters, then transfers the resulting KV cache over commodity Ethernet. The enabler is hybrid-attention models (Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, Ring-2.5-1T) that produce 13-36x smaller KV caches than dense-attention models — making cross-WAN transfer economically viable. Results: 54% higher throughput, 50% lower mean TTFT vs homogeneous baselines.
Architecture
User request (long prompt)
│
[Prefill cluster] ← compute-dense, separate datacenter
Hybrid-attention model ← most layers are linear (O(L) not O(L²))
Produces compressed KV ← 4.66 Gbps vs 59.93 Gbps for dense attention
│
Transfer over commodity Ethernet
│
[Decode cluster] ← memory-bandwidth-optimized, production DC
Receives KV cache ← 13x smaller than dense-attention equivalent
Runs autoregressive decode
│
Output tokens streamed to user
Why hybrid-attention makes this possible: Dense transformers produce KV caches at full attention dimensionality for every layer. Hybrid-attention models interleave a small number of full-attention layers with a larger number of linear-complexity layers (state-space or linear recurrence). The linear layers produce tiny or no KV state. MiMo-V2-Flash produces cache at 4.66 Gbps versus 59.93 Gbps for dense-attention MiniMax-M2.5 — a 13x reduction. At that volume, cross-datacenter Ethernet is fast enough.
Why prefill and decode split differently: Prefill is compute-intensive (process the entire prompt in parallel). Decode is memory-bandwidth-intensive (one token per step, bottlenecked by KV reads). Separating them to hardware optimized for each is the natural split; PrfaaS just moves that split across a WAN link.
Key Numbers
| Model | KV transfer rate | vs Dense |
|---|---|---|
| MiMo-V2-Flash | 4.66 Gbps | 13x reduction |
| Dense MiniMax-M2.5 | 59.93 Gbps | baseline |
| System throughput gain | +54% | vs homogeneous |
| TTFT improvement | -50% | vs homogeneous |
Strategic Implication
This paper changes what "serving topology" means for hybrid-attention models. It creates a feedback loop: models with smaller KV caches enable new serving architectures (cross-DC disaggregation), which reduce infrastructure costs, which create economic incentive to prefer hybrid-attention models. Architecture choice is now a serving economics decision, not just a quality decision.
Relation to Prior Wiki Pages
Directly extends KV cache concept page: Previous work (KV Packet 04-17) addressed reuse; LongAct (04-18) addressed training efficiency at cache positions; TurboQuant (04-22) addresses compression. PrfaaS addresses serving topology — a level up from per-token cache management.
Connects to Nemotron 3 Super (04-21): Nemotron's Mamba blocks produce much smaller state than full attention — same category as hybrid-attention models enabling PrfaaS. The serving economics argument for hybrid attention now has concrete backing.
Connects to AI routing (Tier 1 intersection): PrfaaS is essentially routing prefill to specialist compute. If goodput (SemiAnalysis 04-21) varies by provider, a routing framework that selects prefill providers based on cache transfer cost and compute density per dollar would be a natural extension.
Open Questions
- What is the WAN latency impact on TTFT at tail percentiles? 54% throughput and 50% mean TTFT are strong — but P99 TTFT under network jitter is the production-critical metric.
- Does the cache-transfer approach work with TurboQuant compression applied at the prefill cluster before transfer? That would compound the volume reduction.
- How does disaggregation interact with KV Packet-style cross-request cache reuse? If cached document packets live in the decode cluster, prefill that hits a cached segment could skip transfer entirely.