D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Source: HuggingFace Daily Papers (2026-05-07) Paper: arXiv 2605.05204 · HF Raw: raw

TL;DR

Step-distilled few-step diffusion models (Z-Image-Turbo, FLUX.2-klein) are now the dominant image generation regime, but standard supervised fine-tuning destroys their few-step capability. D-OPSD proposes a self-distillation paradigm where the same model serves as teacher and student under different conditioning: the teacher sees text plus the target image (multimodal), the student sees only text. Training minimises the divergence between the two predictions over the student's own roll-outs, so adaptation happens on-policy without external supervision.

Mechanism

Modern step-distilled diffusion models use an LLM or VLM as the conditioning encoder. D-OPSD exploits the encoder's in-context capability: feed the model the target image plus the text prompt and it produces a sharper, target-conditioned distribution that can serve as a teacher signal for the same model conditioned on text alone. The student rolls out from text, the teacher provides the multimodal-conditioned target, and the loss is the divergence between the two over the student's own trajectory.

Why it matters

Continuous fine-tuning of step-distilled models has been an open practical problem. Standard SFT collapses the few-step capability because the supervision distribution does not match the student's compressed trajectory. D-OPSD sidesteps the teacher-student mismatch entirely by making the model its own teacher with stronger conditioning, which is closer to TIP's on-policy framing for text than to any prior diffusion fine-tuning recipe.

Connections

This is the seventh paper in the neutral exchange / on-policy distillation pattern the wiki has been tracking (knowledge-distillation.md). After BLD (bytes), TESSY (cooperative interleaving), Switch-KD (shared text probability space), Tide (cross-architecture diffusion), and CoPD (parallel co-evolution), D-OPSD adds self-supervised self-distillation under conditioning asymmetry. The neutral channel here is not a separate representation but a different conditioning context for the same network.

The contrast with Stream-R1 (also 05-07) is sharp. Stream-R1 reweights distillation by external reward; D-OPSD eliminates the external teacher entirely. Both papers acknowledge the same core insight: uniform supervision over distilled diffusion rollouts wastes signal, but they take opposite remediation paths.

Research angle

Whether the conditioning-asymmetry trick generalises to text reasoning models is the obvious follow-up. A reasoning student conditioned on a problem statement, a teacher conditioned on the problem plus the gold solution, on-policy distillation between them. This is structurally identical to standard self-distillation but exploits the asymmetric availability of evidence rather than capacity.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

TL;DR

Mechanism

Why it matters

Connections

Research angle

Related