Ken Huang — World Models, Architectures, and the Next Phase of AI
Source: Ken Huang / Agentic AI Substack Raw: raw/rss/2026-05-03-agentic-ai-world-models-architectures-and-the-next-phase-of-ai.md URL: https://kenhuangus.substack.com/p/world-models-architectures-and-the Date: 2026-05-03 Tier: 1/2 — architecture survey with routing/efficiency intersections
TL;DR
A long-form survey framed around the LeCun (JEPA) vs Xing (PAN/GLP) March-2026 debate at the Spring School AI For Impact. JEPA: prediction in latent space without a decoder; the encoder must learn to discard what cannot be predicted. PAN/GLP: latent prediction with a generative decoder as a guardrail against representation collapse. The essay maps five "schools" of world modeling — agent-centric (Dreamer line), foundation-world-model (Genie/Cosmos), spatial-intelligence (World Labs/Marble), video-as-simulation (Sora), and abstraction-with-validation (JEPA + GLP) — and walks through transformers, diffusion, state-space, and recurrent architectures as competing substrates.
Key claims
- The debate is rationalism vs empiricism. LeCun's bet: prediction must throw away the unpredictable. Xing's bet: discarded information turns out to matter on downstream tasks; reconstruction is a guardrail.
- PAN's key technical claim: their generative loss is a strict upper bound on latent loss — minimizing the generative loss minimizes the latent loss, but not vice versa. Reconstruction prevents the encoder/predictor from quietly degenerating.
- V-JEPA 2 (2025) is the strongest empirical case for JEPA: ~1M hours of internet video at 8B scale; SOTA on Something-Something v2 (77.3), Epic-Kitchens-100 (39.7 R@5), PerceptionTest (84.0). V-JEPA 2-AC post-trained from <62 hours of unlabeled robot videos and deployed zero-shot on Franka arms.
- Mamba/SSM is quietly excellent for world models. DRAMA (Mamba-2-based) achieved 105% normalized human performance on Atari100k with 7M parameters vs DreamerV3's 200M. S4WM beat transformers and RNNs on long-term memory tasks at lower training cost.
- Hybrids are the empirical default. Jamba (Mamba+attention 1:7), Nemotron-H (92% Mamba2 layers, 3× LLaMA-3.1 throughput at parity accuracy). For world models specifically: autoregressive temporal + diffusion spatial (Genie 2/3, Astra).
- The LeCun-Xing dispute may be less load-bearing than it appears. A weak-decoder GLP approximates JEPA; a strong-regularizer JEPA approximates a generative model. PAN authors: "GLP strictly subsumes JEPA — turn off the generative function, and you have a JEPA."
Why this matters for cere-bro
This is a Tier 1/2 reference because it gives the architectural map under which routing, KV cache, and compression decisions live. Three intersections:
- TransDreamer's KV-cache constraint is the routing-relevant bottleneck. A transformer state-space model could only imagine from 3 states per replay sample because KV cache makes longer rollouts infeasible. This is the same long-context bottleneck that FlashRT (05-02) attacks for red-teaming and that NeMo-RL speculative decoding (04-30) attacks for RL rollouts. World-model rollouts are another long-context iterative-gradient workload.
- Mamba/SSM as the world-model substrate. If DRAMA-style results (105% human at 1/30 the parameters) generalize, the next world models are not transformers. Composability with the ViPO/Semi-DPO (05-02) preference-data lessons and CoPD (05-01) policy-distillation lessons is uncharted.
- Spatial intelligence vs simulation. World Labs' Marble (Tier 4 normally) is a billion-dollar bet that 3D reconstruction is the foundation, with physics on top. Xing's critique: static 3D ≠ world model. The wiki should track whether World Labs ships physics integration in 2026.
Connections to prior wiki pages
- HOPE Nested Learning (04-28) — biological hippocampal architecture; world-model substrates with multi-time-scale memory parallel HOPE's nested-cycle inspiration. The architectural-bet space is converging from two directions.
- PRL-Bench physics benchmark (04-20) — physics evaluation for agents. World-model evaluation needs the same kind of structured benchmark; pure video-fidelity metrics (FID/FVD) miss the physical-causality dimension Sora fails on.
- AVR Adaptive Visual Reasoning (04-20) — visual reasoning under uncertainty. World models that handle multimodal uncertainty (diffusion family) vs those that compress it away (JEPA) face the same tradeoff this paper raises.
- CoT degrades spatial reasoning (04-22) — text-trained CoT hurts spatial tasks. World-model arguments from Fei-Fei Li / World Labs cite exactly this evidence: language-only training cannot build true 3D understanding.
Research angles
- State-space world models for trajectory routing. DRAMA gets 105% normalized human at 7M params. If composed with Step-level Optimization (05-02), the small policy could itself be a Mamba world-model — making "imagined rollouts" available to the Stuck/Milestone monitors. No paper has tried this.
- Long-context world-model rollouts. The TransDreamer KV-cache wall is the same wall as long-context RL rollouts. FlashRT-style efficiency primitives should transfer directly.
- PAN-style decoder-as-guardrail in language modeling. Reasoning-mode steering (Compliance vs Sensibility, 05-02) is a kind of latent-prediction problem. Adding a "generative validator" — train the model to also reconstruct the conventional output as a check — is structurally analogous to GLP. Worth testing.