Value Gradient Flow: RL as Optimal Transport

Date: 2026-04-19
Source: HuggingFace Daily Papers
Paper: arxiv 2604.14265
Raw: raw/huggingface/2026-04-19-reinforcement-learning-via-value-gradient-flow.md

TL;DR

VGF (Value Gradient Flow) reframes behavior-regularized RL — used in both offline RL and LLM fine-tuning — as an optimal transport problem. Instead of learning an explicit policy, it transports samples from the reference distribution (e.g. the base LLM) toward the value-optimal distribution using gradient flow. No policy parameterization. Adaptive test-time scaling by adjusting transport budget. State-of-the-art on D4RL, OGBench, and LLM RL benchmarks.

Key Findings

Reformulation: behavior-regularized RL (KL-constrained policy optimization) is mathematically equivalent to finding an optimal transport map from the reference distribution to the value-optimal policy distribution
Discrete gradient flow: particles (candidate outputs) are initialized from the reference distribution and moved toward high-value regions via value gradients — no explicit policy network required
Implicit regularization: the transport budget controls how far particles move from reference — larger budget = more aggressive optimization = more divergence from reference
Test-time scaling: at inference, you can run more transport steps to get better quality at higher compute cost — analogous to more "thinking time" but geometrically grounded
No reparameterization needed: eliminates the reparameterization trick required for gradient flow through standard policy gradient — which is what makes those methods hard to scale to large LLMs
Results: SOTA on D4RL and OGBench (offline RL benchmarks), competitive on LLM RL tasks

Why It Matters

Standard LLM RL fine-tuning (RLHF, GRPO, DAPO, PPO) requires explicit policy parameterization and gradient flow through the log-probability space. This is computationally heavy and numerically unstable at scale. VGF sidesteps this by treating the optimization as moving probability mass, not tuning parameters — a cleaner mathematical framing that may be easier to scale.

The test-time compute dimension is particularly interesting: rather than generating more samples and voting (which AIMO 3 showed is limited by capability, not diversity), VGF can apply more transport steps to a single particle, concentrating compute on refinement rather than sampling.

Connection to Prior Work

vs. rejection sampling: VGF moves particles smoothly rather than accepting/rejecting; doesn't get stuck at support boundaries
vs. RLHF/PPO: eliminates the value network; regularization is implicit through transport budget rather than explicit KL penalty
vs. GRPO/DAPO: those methods still require explicit policy gradient; VGF uses value gradients on particles without needing a parameterized policy
Optimal transport in ML: OT has been used in image generation (Wasserstein GANs, flow matching); VGF brings it to the RL/RLHF space

Value Gradient Flow: RL as Optimal Transport

Value Gradient Flow: RL as Optimal Transport

TL;DR

Key Findings

Why It Matters

Connection to Prior Work

Related Pages