TIP: Token Importance in On-Policy Distillation

TL;DR: TIP identifies two regions of high learning signal in on-policy knowledge distillation: high-entropy tokens (uncertain student) and low-entropy + high-divergence tokens (overconfident but wrong student). Entropy-based selection retaining 50% of tokens matches full training while cutting peak memory by 47%; including overconfident tokens enables near-full performance with <10% of tokens.

Key Findings

Two-axis taxonomy (TIP): student entropy × teacher-student divergence. High signal comes from both ends.
High-entropy tokens: student is uncertain — obvious learning signal.
Low-entropy, high-divergence tokens: student is overconfident and wrong — dense corrective signal, nearly invisible to entropy-only rules.
Retaining 50% of tokens via entropy sampling: matches all-token training, reduces peak memory by 47%.
Training on <10% of tokens (targeting overconfident tokens) nearly matches full-token baselines.
Validated across Qwen3, Llama, Qwen2.5 on MATH-500 and AIME 2024/2025; also on DeepPlanning for agentic long-horizon tasks.

Knowledge Distillation

Raw source: ../../raw/huggingface/2026-04-16-tip-token-importance-in-on-policy-distillation.md

TIP: Token Importance in On-Policy Distillation

TIP: Token Importance in On-Policy Distillation

Key Findings

Related Pages