Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks

Date: 2026-05-13 Source: arXiv 2605.11436 · HuggingFace Daily Papers Tier: 2. Agentic systems, long-horizon partial observability, belief-state representation Raw: raw/huggingface/2026-05-13-agent-brace-decoupling-beliefs-from-actions-in-long-horizon.md

TL;DR

Long-horizon agents in partially observable environments hit two problems: they need to maintain uncertainty over the hidden world state, and their context grows without bound. Agent-BRACE separates the agent into a belief-state model and a policy model trained jointly via RL. The belief-state model outputs a structured approximation: a set of atomic natural-language claims about the environment, each tagged with an ordinal verbalized certainty label (certain ... unknown). The policy conditions on this compact belief representation rather than the raw history. Result: +14.5% absolute on Qwen2.5-3B-Instruct and +5.3% on Qwen3-4B-Instruct over RL baselines on embodied long-horizon tasks, with near-constant context window. The learned belief becomes increasingly calibrated as evidence accumulates within an episode.

Why it matters

The standard POMDP solution is a posterior belief state, which has not had a clean LLM-native instantiation. Most agent frameworks just dump the full interaction history into the context window and let the model figure out what's relevant. That dilutes attention (see FocuSFT today on the training-side cause of dilution) and makes context grow linearly with episode length. Agent-BRACE proposes a structured, compact, learned belief representation. The verbalized certainty labels make the belief inspectable and analyzable.

Mechanism

Belief-state model: takes the past observations and actions, produces a structured belief, a list of atomic claims about world attributes, each with a certainty tag. Trained jointly with the policy via RL on episode outcomes.

Policy model: conditions only on the current observation plus the belief representation, not on full history. Selects actions under the explicit uncertainty encoded in the belief.

The "near-constant context window" claim is the load-bearing engineering result. Standard long-horizon agents have context length proportional to episode length. Agent-BRACE keeps context length constant because the belief representation is a fixed-size structured summary that gets updated, not appended.

Relation to prior wiki

EvoChamber / Test-time co-evolution (today HF) — long-horizon multi-agent co-evolution. Agent-BRACE adds the belief-state representation that EvoChamber's agents could use.
On-Policy Self-Evolution via Failure Trajectories (today HF) — self-improvement on long-horizon tasks. Both papers attack the long-horizon agent training problem from different angles; Agent-BRACE adds the state-abstraction primitive that failure-trajectory learning could compose with.
Useful Memories Become Faulty (today) — LLM-rewritten consolidation degrades agent memory. Agent-BRACE's belief representation is a different form of state compression, structured ordinal certainty rather than narrative rewrite. The verbalized-certainty design may be exactly the consolidation gating mechanism the Useful Memories paper called for.
VAKRA Agent Reasoning Failure Modes (2026-04-16) — early paper identifying that agents fail in named ways on long-horizon tasks. Agent-BRACE addresses one of those modes (state-tracking failures) with an explicit belief representation.

Research angle

Two open questions. (1) Calibration of the verbalized certainty labels. The paper reports that beliefs become increasingly calibrated within an episode; whether that holds across episodes (does the agent generalize what "certain" vs "likely" means?) is open. (2) Composition with AgentRunbook-C from LongMemEval-V2 (today). Agent-BRACE handles within-episode belief; AgentRunbook-C handles cross-episode environment knowledge. The joint design (belief-state for current episode, file-system memory for environment experience) is the natural next architectural object.

Why Tier 2

Compact belief-state representations with verbalized certainty labels are a small architectural primitive that could be widely adopted. The +14.5 / +5.3 gains are real but modest. The interesting move is the structural one: decouple belief from action, train both, keep context constant.