inference-efficiency · 2026-05-14 · Tier 1

Orthrus: dual-view diffusion + autoregressive on a shared KV cache

Orthrus: dual-view diffusion + autoregressive on a shared KV cache

Source: HuggingFace Daily Papers · 2026-05-14 Paper: arXiv 2605.12825 Raw: raw Tier: 1. KV cache, parallel decoding, lossless inference acceleration

TL;DR

Orthrus puts an autoregressive head and a diffusion head onto the same frozen LLM, sharing one KV cache. The AR head runs context pre-fill to build a high-fidelity cache; the diffusion head then drafts in parallel from that same cache; an exact consensus mechanism between the two views guarantees the generation is identical to pure AR. The numbers: up to 7.8x speedup with O(1) memory cache overhead and minimal added parameters. The lossless framing is what makes this interesting. Most parallel-decoding work either accepts some quality loss or imposes a separate draft model. Orthrus does neither.

Why it matters

Three speedup axes are converging on the AR autoregressive bottleneck: speculative decoding, MTP (multi-token prediction), and diffusion drafting. Each has a quality-vs-speed knob. Orthrus is the first paper in the wiki where the diffusion head and the AR head share the cache directly, with an exact-consensus mechanism that makes the output bit-identical to AR. That is a structurally different position from Speculative Decoding for RL Rollouts, which uses a separate draft model and accepts the standard draft-verify tradeoff. Orthrus does verification implicitly through the shared cache and the consensus check, so the two heads never disagree on what got generated.

Mechanism

                                          frozen LLM (base weights)
                                                  │
                                ┌─────────────────┴─────────────────┐
                                │             shared KV cache       │
                                └──────────────┬────────────────────┘
                                               │
                       ┌───────────────────────┼────────────────────────┐
                       │                       │                        │
                AR head (pre-fill)      diffusion head           consensus check
                builds KV               (parallel draft)         (exact match)
                                               │                        │
                                               └────► generation = AR ◄──┘
  • AR view: standard transformer head. Runs the prompt pre-fill, populates the KV cache with the same representations a baseline AR model would.
  • Diffusion view: a lightweight trainable module attending to the same cache. Generates tokens in parallel under the diffusion-language-model objective. Critically, it does not write to the cache; it only reads.
  • Consensus: a mechanism the paper calls exact, where the diffusion draft is committed only when it agrees with what the AR head would produce. The paper claims this is provably lossless. The cost is O(1) cache overhead because no second cache is needed.

The diffusion head is trainable but the LLM is frozen. So Orthrus is an inference-acceleration retrofit that does not require re-pretraining the base, similar in spirit to MISA and MDN but operating on the decoder-step axis instead of the attention-pattern axis.

Connections

  • Lighthouse Attention (NousResearch tweet, 2026-05-12, retweeted by @omarsar0, paper) is a parallel idea: train with a removable subquadratic wrapper, recover full attention at the end. Orthrus is the inverse direction at inference: keep the base, add a removable parallel head. Both papers share the design pattern of "add a structure during one phase, deploy without it" but apply it to different stages.
  • TST (Token Superposition Training) (NousResearch, 2026-05-13) speeds up pretraining 2-3x via bag-of-tokens prediction during the first third of training, while keeping the deployed model identical to standard NTP. Orthrus is the inference-time analogue: deploy with parallel acceleration that is identical in output to AR. Two papers in one week converging on the "asymmetric training/inference, identical model" frame.
  • MTP on Unsloth (r/LocalLLaMA practitioner reports) confirms multi-token prediction is now mainstream in production. Orthrus is what MTP looks like when the verification step is exact rather than statistical.
  • Make Each Token Count (2026-05-12) made eviction policy-aware; Orthrus makes drafting policy-identical. Both keep the cache as the load-bearing object.

Research angle

  1. Composes with eviction? If Make-Each-Token-Count's gates can run while Orthrus drafts in parallel, the same cache becomes both the speed and quality lever. The paper doesn't measure this composition. Untested.
  2. Long-context regime. The 7.8x speedup is reported at standard context. Diffusion drafting historically struggles at long context; whether the exact-consensus mechanism degrades is an open question.
  3. Reasoning-model behavior. Reasoning models with long CoT are precisely the workload where the AR bottleneck bites hardest. The paper does not break down speedup by workload type. If reasoning models cleanly inherit the 7.8x, this is a serious production result; if they show degraded consensus rate, the paper's headline overstates.

Where it lives

Update kv-cache.md — Orthrus is the first dual-view-on-one-cache architecture in the wiki. Update speculative-decoding.md — Orthrus extends the speculative thread into the diffusion-head variant with exact verification.