agentic-systems · 2026-04-21 · Tier 2

Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn

Reward-Free Self-Evolution: Agents That Learn Without Being Told What to Learn

Date: 2026-04-21
Source: HuggingFace Daily Papers
Paper: arxiv 2604.18131
Raw: raw/huggingface/2026-04-21-training-llm-agents-for-spontaneous-reward-free-self-evolution.md


TL;DR

Trains agents to spontaneously explore unknown environments before task execution, summarize what they learned as "world knowledge," and use that knowledge internally at inference time — with no external rewards or human instructions at deployment. The training signal is outcome-based: how much does the self-generated world knowledge improve task success rate? At inference, the agent just uses its trained instinct to explore and summarize. Applied to Qwen3-30B and Seed-OSS-36B: +20% on WebVoyager and WebWalker. A 14B Qwen3 model with this training beats unassisted Gemini-2.5-Flash.


Key Findings

  • Training signal: outcome-based reward measuring how much the agent's own exploration improves its task success rate — not a hand-crafted reward for exploration quality
  • Two stages: (1) training with this reward teaches the model how to explore and summarize; (2) at inference, the model applies this skill spontaneously with no external reward
  • Results: +20% on WebVoyager and WebWalker for Qwen3-30B and Seed-OSS-36B
  • 14B beats Gemini-2.5-Flash: the knowledge generated by exploration enables a compact model to outperform an unassisted frontier model
  • The key shift: "native evolution" — the improvement mechanism is baked into the model's weights, not scaffolded by external systems

Comparison to Related Work

System How it improves External supervision at inference?
TRACER (04-17) Accumulates production API logs as training data No — surrogate learns from its own outputs
AccelOpt (04-20) Slow-fast kernel memory No — LLM pattern-matches from its own memory
Self-Evolution (today) World knowledge exploration No — baked into model weights

All three are systems that break the dependency on human-curated data. The progression: TRACER uses implicit labels from production, AccelOpt uses benchmark feedback, Self-Evolution uses task success as the training signal. The meta-pattern: agents that generate and evaluate their own learning signal.


Open Questions

  • At what point does self-generated world knowledge become unreliable? The agent might explore confidently but incorrectly, generating false world models.
  • Does this generalize beyond web navigation (WebVoyager/WebWalker) to other agentic tasks (code generation, API use, data analysis)?
  • The 14B > Gemini-2.5-Flash result is striking — but how does it hold on tasks outside the WebVoyager/WebWalker distribution?

Related Pages