vision-audio-video · 2026-05-02 · Tier 3

ViPO: Visual Preference Optimization at Scale

ViPO: Visual Preference Optimization at Scale

TL;DR

Existing visual preference datasets have conflicting labels — images that excel on aesthetics but fail on semantic alignment get forced into a binary winner/loser label, creating contradictory gradients. ViPO creates 1M image pairs + 300K video pairs with balanced, reliable preference signals, and pairs them with Poly-DPO, which adds a polynomial confidence-calibration term. Key finding: on high-quality data, Poly-DPO collapses to standard DPO — sophistication only helps when data is noisy.

Key findings

  • Existing preference datasets have conflicting patterns from multi-dimensional quality collapse to binary labels.
  • Poly-DPO adds a polynomial term to DPO that dynamically calibrates model confidence based on data quality.
  • ViPO dataset: 1M image pairs (1024px, 5 categories) + 300K video pairs (720p+, 3 categories).
  • On noisy data: Poly-DPO +6.87 / +2.32 over Diffusion-DPO on GenEval (SD1.5, SDXL).
  • On ViPO's high-quality data: Poly-DPO → standard DPO. The polynomial term becomes unnecessary when data is clean.

Relation to prior wiki knowledge

N-of-2 with Semi-DPO (same day): Both papers diagnose the same root cause — multi-dimensional preferences collapsed to binary labels create conflicting training signal. ViPO's response: build better data. Semi-DPO's response: treat conflicting pairs as noisy unlabeled data and pseudo-label. Complementary solutions.

The deeper insight from comparing them: ViPO's finding that Poly-DPO → DPO on clean data implies that Semi-DPO's pseudo-labeling should eventually converge to clean-data DPO too, once the iterative refinement converges. The two papers predict the same long-run fixed point from different starting points.

Links