HERMES++: Unified Driving World Model for 3D Scene Understanding and Generation
Source: HuggingFace Daily Papers (2026-05-07) Paper: arXiv 2604.28196 · HF Raw: raw
TL;DR
HERMES++ unifies 3D scene understanding and future geometry prediction in a single driving world model. A BEV (bird's-eye-view) representation feeds the LLM-compatible spatial structure, LLM-enhanced world queries transfer knowledge from the understanding branch, a Current-to-Future Link conditions geometric evolution on semantic context, and Joint Geometric Optimisation enforces structural integrity through explicit constraints plus implicit latent regularisation. Outperforms specialist baselines on both future point cloud prediction and 3D scene understanding.
Tier note
Tier 4 (3D mapping / driving). Listed for completeness. The interesting structural claim, that semantic-understanding queries can guide geometric prediction within a unified world model, generalises beyond driving but is not tested outside it.