llms-foundation-models · 2026-05-18 · Tier 2

NudgeRL: Strategy-Guided Exploration for RLVR

NudgeRL: Strategy-Guided Exploration for RLVR

Date ingested: 2026-05-18 Source: HuggingFace Daily Papers 2026-05-18 arXiv: 2605.15726 Tier: 2 (RLVR, exploration) Raw: raw/huggingface/2026-05-18-nudging-beyond-the-comfort-zone-...md

TL;DR

NudgeRL conditions each RLVR (Reinforcement Learning with Verifiable Rewards) rollout on a lightweight strategy-level context to induce diverse reasoning trajectories without relying on expensive oracle supervision. Across five mathematical-reasoning benchmarks the framework outperforms vanilla GRPO (Group Relative Policy Optimization, the lightweight on-policy RL recipe that most reasoning post-training pipelines now use) running with rollout budgets up to 8x larger. The unified training objective decomposes the reward signal into an inter-context component (which strategies are working across rollouts) and an intra-context component (within a fixed strategy, which sampled trajectory is better), and adds a distillation objective that transfers strategies discovered to be useful back into the base policy.

Why it matters

The standard fix for sparse-reward RLVR is to increase the rollout budget so the policy stumbles onto more correct trajectories by chance. NudgeRL replaces brute-force rollout scaling with structured exploration. The 8x rollout-budget equivalence is the headline efficiency claim: under matched final accuracy, NudgeRL uses roughly an order of magnitude less compute.

The framing of strategy-level context is what makes the supervision cheap. Strategies are not oracle hints (which would require domain experts) but lightweight prompt-level scaffolds (e.g. "try a case-by-case approach", "set up a recurrence", "find an invariant") that the policy can be encouraged to explore. The result is exploration that respects the structure of the problem space without needing privileged information.

Method

Three components:

  1. Strategy Nudging. Each rollout is conditioned on a strategy-level context, drawn from a lightweight pool. The policy is encouraged to produce a trajectory consistent with that strategy. The strategy is not a target answer or an oracle path; it is a soft scaffold.
  2. Unified objective. The reward signal is decomposed into inter-context (which strategy is producing better outcomes across rollouts) and intra-context (within a strategy, which rollout is better) components. The decomposition is the mechanism by which the policy learns both which strategies to prefer and how to execute each one well.
  3. Distillation back to base. Once useful strategies are discovered, a distillation objective transfers the strategy-conditioned behaviour back into the unconditional base policy. The deployed model is the distilled base, not the strategy-conditioned model. This is the practical contribution: the inference-time policy never needs to see the strategy context.

Connection to prior wiki context

CIPO on the same day (2026-05-18, the paper that converts on-policy failed trajectories into correction-oriented supervision via the model's own rollouts). NudgeRL and CIPO attack the same RLVR weakness (sparse reward, weak credit) from opposite ends. NudgeRL changes what gets explored. CIPO changes how the failures from exploration are reused. The natural composition is strategy-nudged rollouts plus correction-oriented supervision on the failed strategies.

HeavySkill (2026-05-11, the paper that trained parallel-deliberation as an inner skill via RLVR, surfaced via DAIR.AI weekly). HeavySkill trains the model to deliberate by branching into parallel attempts and reconciling. NudgeRL's strategy-level conditioning is a related move at a different layer: instead of branching at inference time, it branches at training time and distills the result. Both treat exploration as a controllable training-time substrate.

RLVR Weak Supervision (2026-04-21, the paper arguing RLVR redistributes existing capability rather than expanding it). NudgeRL's structured-exploration claim is a candidate counter-argument: if structured exploration produces capability that simple rollout-budget scaling does not, then RLVR is doing more than redistribution. The pass@K behaviour of NudgeRL versus vanilla GRPO at matched rollout count would be the cleanest empirical test.

Worth Watching on 2026-05-17 noted cs.LG #10 "LLMs Gaming Verifiers". NudgeRL inherits the standard RLVR reward-hacking risk and adds a new attack surface: a policy could learn to game the strategy context itself. Whether the strategy decomposition makes hacking easier or harder is open.

Research angle

  1. Strategy pool design. The paper uses a fixed lightweight pool. How sensitive is the 8x efficiency claim to the pool composition? Falsifiable: ablation across strategy-pool sizes (10, 50, 200) and types (hand-curated, LLM-generated, mined-from-corpus).
  2. Generalisation beyond math. Five math benchmarks is the right testbed for first-cut RLVR, but the strategy-level abstraction should transfer to code (where strategies are "iterate", "recurse", "use a hashmap") and agentic tasks (where strategies are "explore-then-act", "tool-call-then-verify"). Whether the 8x claim survives outside math is the deployment-relevant test.
  3. Composition with CIPO. As above.

Links