SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

Source: HuggingFace Daily Papers · arXiv 2605.14051 Raw: farmer file Tier: 2 — agent planning, DAG validation, cost control

TL;DR

LLM planners frequently produce structurally invalid or unnecessarily long workflows, causing brittle failures and avoidable tool / API cost. SPIN is a planning wrapper that enforces a strict DAG contract through validation and repair prompting (the plan must parse as a valid directed acyclic graph; if not, it gets repaired before execution), then evaluates DAG prefixes incrementally and stops execution when the current prefix already answers the query. On AssetOpsBench (261 scenarios): executed tasks drop from 1061 to 623, Accomplished rises from 0.638 to 0.706, tool calls per run drop from 11.81 to 6.82. On MCP Bench the same wrapper improves planning, grounding, and dependency-related scores for GPT OSS1 and Llama 4 Maverick.

Why it matters

Two days ago AssetOpsBench was the benchmark that caught a measurement crisis: -0.13 Spearman correlation between standard accuracy and an industrial Accomplished metric (2026-05-14 digest). SPIN is the first published wrapper that improves both numbers on the same benchmark in the same week. The improvement is modest but real (Accomplished 0.638 → 0.706), and it comes from a structural rather than learned mechanism: parse, validate, repair, then short-circuit when the prefix is sufficient.

The deeper read is that DAG planning has been the obvious solution for two years, and nobody pulled it through to production. SPIN is what it looks like when the obvious wrapper finally ships with empirical numbers on the right benchmark.

Connections to prior wiki state

AssetOpsBench's -0.13 correlation finding (2026-05-14 digest) — SPIN moves Accomplished from 0.638 to 0.706 on the same benchmark. Whether the Accomplished improvement corresponds to a closure of the accuracy-Accomplished gap, or whether standard accuracy went up too, is the load-bearing measurement question. AssetOpsBench is the right place to ask it.
WildClawBench harness sensitivity (2026-05-15) — 18-point harness spread. SPIN is a harness modification. Its effect would be expected to be at least that large under WildClawBench's native-runtime evaluation. Untested.
Map-Then-Act paradigm (2026-05-14) — separates exploration / mapping from execution. SPIN's prefix-evaluation is structurally similar: evaluate the plan incrementally, stop when sufficient. Two papers on the same mechanism at different abstraction levels in 48 hours.
Agent-Brace (2026-05-13) and AgentLens (2026-05-14) — both about wrapping or measuring agent execution. SPIN is the closest the wiki has seen to an execution-time wrapper that directly improves cost / reliability without retraining.
MCP Bench — SPIN's improvements on MCP Bench fit the agentic-tooling thread (MCP server selection as routing problem, Ken Huang Ch 13).

How it works

Stage 1 — DAG contract enforcement. The LLM planner produces a workflow proposal. _validate_plan_text parses it as a DAG, checking acyclicity, well-formed dependencies, and node legality. Invalid plans trigger repair prompting: the planner is shown the validation error and asked to fix it. This loop continues until the plan parses or budget is exhausted.

Stage 2 — Prefix-stop execution. Instead of executing the full DAG and answering at the end, SPIN evaluates DAG prefixes incrementally. After each layer of executed nodes, SPIN asks whether the current state already answers the query. If yes, execution stops. This is what produces the 1061 → 623 executed-task reduction.

The combined effect on AssetOpsBench: shorter, structurally valid plans that stop as soon as they have enough. The Accomplished improvement (0.638 → 0.706) suggests the prefix-stop is not just cutting cost, it is actually selecting better workflows by not over-executing.

Open problems / Research angle

SPIN under WildClawBench. WildClawBench's native runtime is the cleanest way to measure how much of SPIN's gain is harness-specific. Falsifiable: a 60-day run of SPIN-wrapped models on WildClawBench, with reported Accomplished or equivalent.
DAG repair quality. The paper does not report what fraction of plans need repair, how many repair iterations are typical, or what plans the repair loop fails on. The repair loop is the load-bearing part of the wrapper; ablation would be informative.
Prefix-stop reliability. Stopping early can be wrong. SPIN reports improved Accomplished, but the failure mode (stopping before the actual answer is reached) is not characterized. A confusion matrix would close the loop.
Composition with Orchard / SDAR. Orchard ships training-side infrastructure; SDAR ships training-side recipe. SPIN is the deployment-side wrapper. The natural composition is to train with Orchard + SDAR and deploy with SPIN. None of the three papers compose with each other.

Concept tags

agent-planning · dag-validation · prefix-stop · tool-cost-reduction · assetopsbench