ZEDA: Post-Trained MoE Can Skip Half Experts via Self-Distillation
arXiv: 2605.18643 · HF: paper page · Tier: 1 (MoE, dynamic routing, post-training compression)
TL;DR
Existing dynamic MoE methods (input-dependent expert activation) usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained static MoE underexplored. ZEDA (Zero-Expert Self-Distillation Adaptation) is a low-cost framework that converts post-trained static MoE models into dynamic ones. It injects parameter-free zero-output experts into each MoE layer and adapts via two-stage self-distillation, using the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks (math, code, instruction following), ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss, outperforms the strongest dynamic-MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20x end-to-end inference speedup.
Key findings
- Static MoE pre-training picks a fixed top-K. Dynamic MoE (variable K per token) reduces compute on easy tokens by letting them skip experts entirely. Prior dynamic-MoE methods require pre-training from scratch or task-specific adaptation, so they cannot be applied to frontier post-trained static MoEs.
- ZEDA solves the architectural conversion by injecting parameter-free "zero experts" (experts that always output zero) into each MoE layer. The augmented model has K+1 routing slots per token, where the zero expert is the explicit option to skip computation.
- Two-stage self-distillation stabilises the conversion. The original frozen static MoE is the teacher. A group-level balancing loss prevents the router from collapsing all tokens onto the zero experts.
- On Qwen3-30B-A3B and GLM-4.7-Flash, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic-MoE baseline by 6.1 and 4.0 points respectively.
- End-to-end inference speedup is ~1.20x; the gap between FLOP reduction (50%) and wall-clock (1.20x) is the standard MoE overhead from routing logic, expert dispatch, and load balancing on real hardware.
Relationship to prior wiki entries
ZEDA is the fourth distinct MoE compression / routing direction the wiki has tracked in the last two weeks. The four directions are now:
- Pre-training scaling (MoE-muP, 2026-05-17, Vankadara et al., Kurate cs.LG #13). Closed-form Maximally Scale-Stable Parameterization across the five MoE axes M, Ne, K, N, L. Answers: how to scale a new MoE.
- Per-token activation (BEAM, 2026-05-16). Binary expert-activation masks trained end-to-end via straight-through estimator. 98%+ retention at 85% FLOP reduction. Answers: which experts to run for this token.
- Post-training resident-count compression (HodgeCover, 2026-05-18). Harmonic-kernel obstruction in the simplicial Laplacian on the expert 2-complex. Answers: which experts to keep at aggressive compression.
- Post-training static-to-dynamic conversion (ZEDA, today). Zero experts plus two-stage self-distillation. Answers: how to skip experts per token without retraining from scratch.
ZEDA and BEAM both decide which experts run per token, but they are not redundant. BEAM is a from-scratch training intervention; ZEDA is a post-training conversion of a frozen MoE. Frontier labs that have already shipped Gemma 4, DeepSeek V4, Kimi K2.6, and Qwen3.5 do not retrain those models from scratch. ZEDA is the post-hoc retrofit path BEAM cannot offer.
The natural composition the wiki has been calling out is now closer: a future MoE pre-trained under MoE-muP MSSP, post-training compressed via HodgeCover (resident experts), then dynamically routed via ZEDA (active experts per token) on top of BEAM-style per-token activation. The full stack would replace three independent ad-hoc choices in frontier MoE serving with three principled methods.
Why it matters
The post-training conversion path is the deployable one. Frontier open MoE releases ship as static top-K models. Every operator running them in production wants the dynamic-MoE win without paying for a from-scratch retrain. ZEDA is the first wiki entry demonstrating that the conversion is cheap enough (two-stage self-distillation, no external teacher, no task data) to be applied as a routine post-release step. The 50% FLOP reduction at marginal accuracy loss on two distinct frontier open MoEs is the bar that makes this a candidate operational practice rather than a research curiosity.
Research angle
- ZEDA on the full open-MoE wave. Apply to Gemma 4 26B-A4B, DeepSeek V4 Flash, Kimi K2.6, MiMo-V2.5-Pro. If FLOP-reduction-at-marginal-loss holds on three of four, the conversion is robust to MoE design choice. If it collapses on one specific architecture, the failure mode identifies a fragility in that MoE's expert specialisation.
- Compose with HodgeCover. HodgeCover removes experts at the layer level. ZEDA skips experts at the token level. The composition's diagnostic: do the two methods select overlapping experts to suppress, or orthogonal ones? If overlapping, HodgeCover's chosen-to-remove experts are also the ones ZEDA's zero-expert is replacing, and the two are partially redundant. If orthogonal, they are complementary and the stack is multiplicative.
- Zero-expert as an alignment signal. When the router prefers the zero expert, the model is implicitly saying this token does not need this layer's expert computation. The distribution of zero-expert routing across token types is an interpretability signal. The wiki has no prior entry where the routing distribution carried interpretability content directly.
Source
raw/huggingface/2026-05-19-post-trained-moe-can-skip-half-experts-via-self-distillation.md