EMO: Pretraining Mixture of Experts for Emergent Modularity

TL;DR

LLMs are deployed as monolithic blobs even when the application needs only a slice of capability. EMO pretrains an MoE so experts naturally cluster by domain without human-defined priors. The trick is at the document level: tokens within a document share an expert pool, but different documents use different pools. At 1B active / 14B total parameters, 1T tokens, EMO retains 25% of experts at a 1% absolute drop and 12.5% at a 3% drop. Standard MoEs break completely under the same regime. From Allen Institute for AI and UC Berkeley.

Why this matters

The monolith-deployment problem is a real constraint. A code agent does not need the entire 14B. A medical Q&A agent does not need the entire code-generation slice. Standard MoE in principle gives you sparse experts. In practice, expert specialization has been measured as token-level (punctuation, prepositions, lexical patterns) not domain-level (medicine, code, math). EMO is the first MoE pretraining objective to produce domain-level specialization that survives subset deployment.

Mechanism

Standard MoE pretraining:
  every token routes independently
  → experts specialize on lexical surface
  → drop 75% of experts → catastrophic loss

EMO pretraining:
  document boundary defines a pool
  every token in document D draws from pool(D) ⊂ all experts
  pool(D) for different documents is allowed to differ
  → tokens that share a domain share experts
  → drop 75% of experts (keep top 25%) → ~1% loss

The architectural change is small: a per-document expert-pool selector that gates which experts are visible to the router. The pretraining loss is unchanged. Modularity is emergent, not forced. The bet is that documents themselves carry domain coherence, so document-level expert sharing produces domain-level expert clustering as a byproduct.

Connections to prior wiki

Cross-paper convergence on the same day. EMO and UniPool (2605.06665) both attack standard MoE's per-layer expert ownership but from opposite directions. UniPool generalizes: pool the experts globally across layers. EMO restricts: pool the experts within a document. Same failure mode (lexical-not-domain specialization, subset-deploy collapse), two complementary fixes. Both ship empirical wins. This is the second architectural-modularity day in two weeks (Nemotron-3 Super on 04-21 was the first), and the first that explicitly targets deployment-time slicing.

Connection to MedSkillAudit (05-07) and SkillRepo work. EMO's 25%-retention number is the missing primitive that makes domain-specific agent skills cheap to deploy. Today's three skill-curation papers (StraTA, Skill1, SkillOS) all assume the underlying LLM is monolithic. EMO is the architecture that lets the skill layer actually slice the model.

Refines a prior open question. The MoE concept page (inference-efficiency/) has flagged the gap between MoE's theoretical sparsity and practical deployability. Standard MoE gives sparse activation but not sparse deployment. EMO ships the first deployment-friendly sparsity, with a hard number: 12.5% retention at 3% loss.

Research angle

What does the expert clustering actually look like? Allen AI typically open-sources their pretraining and data, so the cluster structure should be inspectable. The interesting empirical question is whether the emergent clusters track human-recognizable domains (code, math, biomedical) or some lower-level semantic axis. If they track domains, EMO becomes a routing-cherry. If they track something orthogonal, the deployment story becomes harder.
Composition with Direct Corpus Interaction (DCI, 2605.05242, also today). DCI replaces the retrieval pipeline with grep on raw corpus. EMO replaces the model deployment with an expert subset. Both move complexity out of dense modules into sparse interfaces. The pair is a candidate for a single deployment stack.
What is the cliff? EMO drops 1% at 25%, 3% at 12.5%. Where does the cliff sit? At 5%? At 1%? The slope of the retention curve determines whether this is a 4x compression story or a 100x compression story.

Source

Paper: https://arxiv.org/abs/2605.06663
HuggingFace blog: https://huggingface.co/blog/allenai/emo
HuggingFace paper page: https://huggingface.co/papers/2605.06663
Raw: raw/huggingface/2026-05-09-emo-pretraining-mixture-of-experts-for-emergent-modularity.md