STOP: Super Token for Path Pruning in Parallel Reasoning
TL;DR
Parallel reasoning in Large Reasoning Models (LRMs) wastes compute on paths that fail early due to errors that compound. STOP (Super TOken for Pruning) introduces a learnable special token inserted at the prefix that learns to predict whether a reasoning path is worth continuing. At a fixed compute budget, it lifts GPT-OSS-20B accuracy on AIME25 from 84% to ~90%. Works across models from 1.5B to 20B parameters.
Key Findings
The taxonomy — first systematic framework for path pruning:
Signal Source
Internal │ External
──────────┼────────────
Non-learnable │ entropy │ verifier
Learnable │ STOP ★ │ trained judge
Prior work fragmented across all four quadrants. Learnable internal methods (bottom-left) were unexplored before STOP. The insight: internal signals (from the model itself, not an external oracle) are cheaper to obtain; learnable signals generalize better than hand-crafted heuristics.
STOP mechanism:
- A learned super-token is prepended to each parallel path at the prefix level
- The token learns to represent "should this path continue?" based on the prefix so far
- Pruning happens before the path is fully generated — saving computation proportional to the path's remaining length
- "Prefix-level" matters: early pruning saves more compute than late pruning
Scalability validation: Results across 1.5B → 20B parameter LRMs. The +6% AIME25 improvement (84% → 90%) holds at fixed compute budget — meaning STOP doesn't just reduce cost at fixed quality, it improves quality at fixed cost.
Empirical guidelines distilled: The paper includes deployment guidance on: (1) when path pruning helps most (high error-early path rate), (2) how to set the pruning threshold, (3) how to combine with external verifiers.
Connection to the Broader Selective-Compute Pattern
This is the fifth consecutive paper in the wiki (since 04-16) instantiating the same core insight: identify and discard uninformative computation.
TIP (04-16): discard uninformative tokens in distillation (10% signal tokens)
LongAct (04-18): discard uninformative gradient positions (sparse RL on high-saliency)
TESSY (04-18): discard stylistically-divergent teacher tokens
STOP (04-20): discard futile reasoning paths at the prefix (before spending compute)
AVR (04-20): discard unnecessary reasoning format (pick perception-only or direct-answer)
The granularity is shifting: TIP operates at the token level, STOP at the path level, AVR at the reasoning format level. These are not competing methods — they can stack.
Relations to Prior Wiki Pages
- AIMO 3 / model-capability-dominates (04-17): AIMO 3 showed that prompt-level diversity can't close the pass@20 gap. STOP doesn't try to close the diversity gap — it makes each parallel path more cost-efficient. This is orthogonal to but composable with capability improvements.
- VGF (04-19): VGF offered targeted particle refinement as an alternative to wider sampling. STOP is a middle path — sample widely, but prune failing paths early. Together they bracket the design space: refine-one (VGF) vs. prune-many (STOP).
- Knowledge Distillation: STOP could be applied to distillation: prune candidate rollouts that are clearly off-track before the student completes them.
Open Questions
- Does STOP generalize to open-ended reasoning (code, math proofs) or only constrained tasks with identifiable early failure signals?
- Can STOP be combined with VGF's particle-transport approach? (Transport only paths that pass STOP's gate)
- What does the super-token learn? Is it identifying syntactic errors, semantic inconsistencies, or something more abstract?
Raw Source
→ raw/huggingface/2026-04-20-cut-your-losses-learning-to-prune-paths-early-for-efficient.md