UniDoc-RL: RL-Based Visual RAG with Hierarchical Actions

Date: 2026-04-19
Source: HuggingFace Daily Papers
Paper: arxiv 2604.14967
Raw: raw/huggingface/2026-04-19-unidoc-rl-coarse-to-fine-visual-rag-hierarchical-actions-dense.md

TL;DR

UniDoc-RL trains a vision-language model (LVLM) as an RL agent that performs visual retrieval, reranking, and active perception in a single hierarchical loop. Instead of retrieving fixed chunks, the agent progressively refines its evidence from coarse document retrieval → fine-grained image selection → active region cropping. Dense multi-reward training via GRPO, no separate value network. Up to 17.7% gains over prior RL-based visual RAG methods.

Key Findings

Unified agent: one LVLM jointly performs retrieval, reranking, image selection, and region cropping — no separate modules for each step
Hierarchical action space: three tiers of action — coarse (document), medium (image), fine (region crop) — modeled as a sequential decision problem
Dense rewards: instead of a sparse end-task reward, each action level gets its own reward signal, solving credit assignment for long retrieval trajectories
GRPO training: builds on Group Relative Policy Optimization without needing a value network (same training family as LongAct, AIMO 3)
Results: +17.7% on three multimodal QA benchmarks vs. prior RL-based visual RAG baselines

Mechanism

Standard visual RAG: retrieve fixed-size chunks → feed to LLM → generate answer. The agent never controls what it sees within a retrieved document.

UniDoc-RL's agent sees a query and decides:

Which documents to retrieve (coarse)
Which images within those documents are relevant (medium)
Which regions of those images to crop and zoom into (fine)

Each of these is a discrete action. GRPO updates the policy using group-relative rewards at each level. The result is an agent that actively looks where the evidence is, rather than passively consuming whatever the retriever returned.

Comparison to Related Work

System	Retrieval	Action space	Training
Standard RAG	Dense retrieval	None — passive consumer	SFT
Corpus2Skill	LLM navigation	Document tree traversal	SFT
UniDoc-RL	RL agent	Hierarchical (doc → image → region)	RL (GRPO)

UniDoc-RL is complementary to Corpus2Skill. Corpus2Skill compiles the corpus offline and navigates statically. UniDoc-RL trains the agent to actively adapt retrieval at query time.

UniDoc-RL: RL-Based Visual RAG with Hierarchical Actions

UniDoc-RL: RL-Based Visual RAG with Hierarchical Actions

TL;DR

Key Findings

Mechanism

Comparison to Related Work

Related Pages