Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn

Date: 2026-04-21
Source: HuggingFace Daily Papers
Paper: arxiv 2604.18131
Raw: raw/huggingface/2026-04-21-training-llm-agents-for-spontaneous-reward-free-self-evolution.md

TL;DR

Trains agents to spontaneously explore unknown environments before task execution, summarize what they learned as "world knowledge," and use that knowledge internally at inference time — with no external rewards or human instructions at deployment. The training signal is outcome-based: how much does the self-generated world knowledge improve task success rate? At inference, the agent just uses its trained instinct to explore and summarize. Applied to Qwen3-30B and Seed-OSS-36B: +20% on WebVoyager and WebWalker. A 14B Qwen3 model with this training beats unassisted Gemini-2.5-Flash.

Key Findings

Training signal: outcome-based reward measuring how much the agent's own exploration improves its task success rate — not a hand-crafted reward for exploration quality
Two stages: (1) training with this reward teaches the model how to explore and summarize; (2) at inference, the model applies this skill spontaneously with no external reward
Results: +20% on WebVoyager and WebWalker for Qwen3-30B and Seed-OSS-36B
14B beats Gemini-2.5-Flash: the knowledge generated by exploration enables a compact model to outperform an unassisted frontier model
The key shift: "native evolution" — the improvement mechanism is baked into the model's weights, not scaffolded by external systems

Comparison to Related Work

System	How it improves	External supervision at inference?
TRACER (04-17)	Accumulates production API logs as training data	No — surrogate learns from its own outputs
AccelOpt (04-20)	Slow-fast kernel memory	No — LLM pattern-matches from its own memory
Self-Evolution (today)	World knowledge exploration	No — baked into model weights

All three are systems that break the dependency on human-curated data. The progression: TRACER uses implicit labels from production, AccelOpt uses benchmark feedback, Self-Evolution uses task success as the training signal. The meta-pattern: agents that generate and evaluate their own learning signal.

Open Questions

At what point does self-generated world knowledge become unreliable? The agent might explore confidently but incorrectly, generating false world models.
Does this generalize beyond web navigation (WebVoyager/WebWalker) to other agentic tasks (code generation, API use, data analysis)?
The 14B > Gemini-2.5-Flash result is striking — but how does it hold on tasks outside the WebVoyager/WebWalker distribution?

Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn

Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn

TL;DR

Key Findings

Comparison to Related Work

Open Questions

Related Pages