Exploration and Exploitation Errors Are Measurable for Language Model Agents
TL;DR: This paper introduces a framework to quantify explore/exploit errors in LM agents using policy-agnostic metrics on controllable 2D grid environments. Even frontier models fail significantly, with reasoning models performing best and both dimensions improvable through harness engineering.
Key Findings
- Designed controllable 2D grid environments with partially observable maps and unknown task DAGs — map generation can be tuned to emphasize either exploration or exploitation difficulty.
- Policy-agnostic metric quantifies errors purely from observed actions, without needing internal policy access.
- All frontier LMs struggle; different models show distinct failure modes (some over-explore, some over-exploit).
- Reasoning models (chain-of-thought, o-series style) solve the task more effectively.
- Both exploration and exploitation can be significantly improved with minimal harness engineering — suggesting current failures are environmental/scaffolding rather than fundamental capability gaps.
Related Pages
Raw source: ../../raw/huggingface/2026-04-16-exploration-and-exploitation-errors-are-measurable-for-langu.md