Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn
Date: 2026-04-21
Source: HuggingFace Daily Papers
Paper: arxiv 2604.18131
Raw: raw/huggingface/2026-04-21-training-llm-agents-for-spontaneous-reward-free-self-evolution.md
TL;DR
Trains agents to spontaneously explore unknown environments before task execution, summarize what they learned as "world knowledge," and use that knowledge internally at inference time — with no external rewards or human instructions at deployment. The training signal is outcome-based: how much does the self-generated world knowledge improve task success rate? At inference, the agent just uses its trained instinct to explore and summarize. Applied to Qwen3-30B and Seed-OSS-36B: +20% on WebVoyager and WebWalker. A 14B Qwen3 model with this training beats unassisted Gemini-2.5-Flash.
Key Findings
- Training signal: outcome-based reward measuring how much the agent's own exploration improves its task success rate — not a hand-crafted reward for exploration quality
- Two stages: (1) training with this reward teaches the model how to explore and summarize; (2) at inference, the model applies this skill spontaneously with no external reward
- Results: +20% on WebVoyager and WebWalker for Qwen3-30B and Seed-OSS-36B
- 14B beats Gemini-2.5-Flash: the knowledge generated by exploration enables a compact model to outperform an unassisted frontier model
- The key shift: "native evolution" — the improvement mechanism is baked into the model's weights, not scaffolded by external systems
Comparison to Related Work
| System | How it improves | External supervision at inference? |
|---|---|---|
| TRACER (04-17) | Accumulates production API logs as training data | No — surrogate learns from its own outputs |
| AccelOpt (04-20) | Slow-fast kernel memory | No — LLM pattern-matches from its own memory |
| Self-Evolution (today) | World knowledge exploration | No — baked into model weights |
All three are systems that break the dependency on human-curated data. The progression: TRACER uses implicit labels from production, AccelOpt uses benchmark feedback, Self-Evolution uses task success as the training signal. The meta-pattern: agents that generate and evaluate their own learning signal.
Open Questions
- At what point does self-generated world knowledge become unreliable? The agent might explore confidently but incorrectly, generating false world models.
- Does this generalize beyond web navigation (WebVoyager/WebWalker) to other agentic tasks (code generation, API use, data analysis)?
- The 14B > Gemini-2.5-Flash result is striking — but how does it hold on tasks outside the WebVoyager/WebWalker distribution?