llms-foundation-models · 2026-05-18 · Tier 2

CIPO: Correction-Oriented Policy Optimization with Verifiable Rewards

CIPO: Correction-Oriented Policy Optimization with Verifiable Rewards

Date ingested: 2026-05-18 Source: HuggingFace Daily Papers 2026-05-18 arXiv: 2605.14539 Tier: 2 (RLVR, reasoning) Raw: raw/huggingface/2026-05-18-learning-from-failures-...md

TL;DR

CIPO is a drop-in extension to standard Reinforcement Learning with Verifiable Rewards (RLVR, the paradigm where the reward signal is a programmatic verifier, e.g. unit tests, exact-match on a math answer) that converts on-policy failed trajectories into correction-oriented supervision. The mechanism is task-agnostic: no process reward model, no LLM critic, no extra human labelling. The optimization runs the standard RLVR objective jointly with a correction objective derived from the model's own failed attempts. Across 11 benchmarks spanning mathematical reasoning and code generation, CIPO outperforms strong RLVR baselines. The headline structural claim is the pass@K gain: CIPO improves pass@K more than it improves pass@1, which indicates that the method expands the model's intrinsic reasoning capacity rather than redistributing probability mass over already-discovered correct answers.

Why it matters

Standard GRPO-family RLVR has a known sparse-credit problem: a failed rollout receives a binary negative signal and the gradient direction is undifferentiated across "wrong because the last step missed a case" and "wrong because the first step misread the problem." CIPO addresses this without adding a reward model. That makes it a cheap surgical patch on the standard RLVR loop rather than a new pipeline.

The pass@K-over-pass@1 gain is the load-bearing claim. Prior critiques of RLVR (RLVR Weak Supervision 2026-04-21, the wiki's standing reference that argued RLVR mostly reorders probability mass without expanding the underlying capability) predicted exactly the opposite outcome for naive RLVR. CIPO contradicts that prediction in the direction the field needs.

Method

Three pieces:

  1. Failed-trajectory mining. For each prompt, sample N rollouts. Partition into successes (verifier returns pass) and failures (verifier returns fail). The failures are the supervision substrate that ordinary RLVR throws away.
  2. Correction-oriented supervision. From each failed trajectory, construct a "correction sample" that pairs the failed prefix with a correct continuation derived from the same model's own success rollouts on adjacent prompts or via a constrained re-sampling step. The paper's load-bearing detail is that this supervision uses no external signal: it is derived entirely from the on-policy rollout pool.
  3. Joint objective. The training objective is the standard RLVR objective plus a correction loss term that asks the model to attend to the failed prefix and emit the correction continuation. The hyperparameter weighting between RLVR and correction is task-agnostic in the paper.

Connection to prior wiki context

NudgeRL on the same day (2026-05-18, arXiv 2605.15726, the paper that conditions each RLVR rollout on a lightweight strategy-level context to induce diverse reasoning trajectories without oracle supervision, reporting up to 8x larger effective rollout budget than vanilla GRPO). CIPO and NudgeRL address the same RLVR weakness from different sides. NudgeRL diversifies what gets explored; CIPO extracts more signal from what was explored and failed. Composing them is the natural next experiment: strategy-nudged exploration plus correction-oriented supervision on the resulting failures.

RLVR Weak Supervision (2026-04-21). That paper argued RLVR mostly redistributes existing capability rather than expanding it. CIPO's pass@K gain is the first concrete counter-evidence in the wiki to that critique. The wiki should track whether the pass@K gain replicates on independent benchmarks.

LongAct (2026-04-18, the paper that found long-context training signal concentrates in the first 5% of high-magnitude Q/K activations and restricted RL gradient updates to those weights, yielding ~8% gain on LongBench v2). LongAct restricts where the gradient flows; CIPO restricts what the gradient supervises. Different layers of the same fix: not all gradients are equally informative, and not all training trajectories are equally informative.

VGF (2026-04-19, value-gradient flow as the distribution-transport formulation of RL post-training). VGF asks where probability mass should flow at the distribution level. CIPO asks how to convert each failure into a finite-sample signal that nudges that flow. The frames are compatible.

LLMs Gaming Verifiers (Kurate cs.LG #10 this week, ai_rating 6.8/10, the paper showing RLVR pipelines can be reward-hacked by the policy learning to game the verifier rather than solve the task). CIPO uses the verifier signal more aggressively, so it inherits the reward-hacking risk. Whether CIPO amplifies or dampens that risk is not addressed in the paper and is worth a follow-up.

Research angle

  1. Pass@K gain stability. The headline claim depends on pass@K rising more than pass@1. Whether that holds at large K (>=32) on independent benchmarks like AIME 2026 is the first easy falsifier.
  2. Composition with NudgeRL. Falsifiable in one experiment: strategy-nudged exploration (NudgeRL) plus correction-oriented supervision (CIPO) versus the sum of their individual gains.
  3. Reward-hacking diagnostic. Run the Kurate cs.LG #10 reward-hacking diagnostic on a CIPO-trained model.

Links