Token Superposition Training (TST): Efficient Pre-Training with Token Superposition

Date: 2026-05-13 Source: arXiv 2605.06546 · HuggingFace Daily Papers Tier: 1. Pre-training efficiency, asymmetric-training-identical-inference pattern Raw: raw/huggingface/2026-05-13-efficient-pre-training-with-token-superposition.md

TL;DR

Pre-training has been assumed to require one token per forward pass for next-token prediction. TST argues this is wasteful in the early phase of training, when the model is still learning crude statistics. In the superposition phase, contiguous tokens are bundled into a single position and the model predicts the whole bag via a multi-hot cross-entropy (MCE) objective. Then a recovery phase reverts to standard next-token prediction. No changes to parallelism, optimizer, tokenizer, data, or architecture. Validated at 270M, 600M, 3B, and 10B-A1B MoE. At equal-loss settings, TST cuts total pre-training time up to 2.5x at 10B-A1B scale. The deployed model is identical to a standard NTP-trained model.

Why it matters

This is the cleanest example yet of the "asymmetric training, identical inference" pattern that has been emerging across the wiki. The training-time structure (token bags, multi-hot prediction) is removed at inference. The cost is paid once during pre-training and never again. At 2.5x training-time reduction on 10B-A1B, this is the kind of result that gets adopted by any lab without competing incentives.

Mechanism

The superposition phase combines K contiguous tokens into one position by summing their embeddings (or a related aggregation). The model's prediction at that position is a multi-hot vector over the K target tokens, scored by multi-hot cross-entropy. This is much higher data throughput per FLOP because the same forward pass effectively predicts K positions. The recovery phase switches back to standard NTP for the last fraction of training, which re-aligns the model with deployment.

The crucial claim: TST consistently outperforms baseline loss and downstream evaluations at equal compute, not just at equal time. This rules out the obvious failure mode where the bag-of-tokens prediction is a weaker objective that just looks fast.

Relation to prior wiki

Lighthouse Attention (2026-05-12 retweet) — trains with a removable subquadratic wrapper, deploys without it. Same asymmetric-training pattern.
MDN (2026-05-11) and UniPrefill (2026-05-11) — hybrid architectures train with one structure, deploy with cheaper inference. Same family.
D-OPSD (2026-05-07) — same model as teacher and student under different conditioning. Asymmetric conditioning at training, symmetric at inference.
The pattern is now four papers strong: train the model with a structure the deployed model does not have, exactly when the structure does not change the output distribution. TST is the first to apply it to pre-training itself.

Research angle

Three open problems. (1) Does the recovery phase need to scale with model size? At 10B-A1B the recovery is small; at frontier scale it may need to be larger. (2) The MCE objective is naturally compatible with mixture-of-experts routing (multiple tokens predicted per position). Whether TST composes with MTP-style multi-token prediction is the cleanest near-term experiment. (3) The chain-of-thought distillation literature uses dense token-level signal. If you pre-train with TST and then distill with TIP, do the two efficiency techniques compose multiplicatively, or does one of them eat the other's gain? Open.

Why Tier 1

2.5x training-time reduction with no inference cost is the kind of headline that gets reproduced fast. If it holds at frontier scale, it changes pretraining economics for everyone. The asymmetric-training pattern is also becoming a load-bearing architectural primitive across the wiki.