vision-audio-video · 2026-04-16 · Tier 3

RationalRewards: Reasoning Rewards Scale Visual Generation at Training and Test Time

RationalRewards: Reasoning Rewards Scale Visual Generation at Training and Test Time

TL;DR: Teaching reward models to produce explicit multi-dimensional critiques before scoring (via PARROT — Preference-Anchored Rationalization) turns them into active optimization tools. At training time they provide structured RL rewards; at test time a Generate-Critique-Refine loop improves outputs without any parameter updates, matching or exceeding RL fine-tuning on several benchmarks.

Key Findings

  • Standard reward models collapse human preference to a single scalar — discarding the reasoning that drives the preference.
  • PARROT: recovers high-quality rationales from preference data without rationale annotations via anchored generation, consistency filtering, and distillation.
  • RationalRewards (8B): state-of-the-art among open-source reward models, competitive with Gemini-2.5-Pro, using 10–20× less training data.
  • Training time: structured rationales provide fine-grained RL rewards — outperforms scalar alternatives.
  • Test time: Generate-Critique-Refine loop matches or exceeds RL fine-tuning on several benchmarks — no parameter updates needed.
  • Key insight: structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Related Pages

Raw source: ../../raw/huggingface/2026-04-16-rationalrewards-reasoning-rewards-scale-visual-generation-bo.md