vision-audio-video · 2026-05-07 · Tier 3

HERMES++: Unified Driving World Model for 3D Scene Understanding and Generation

HERMES++: Unified Driving World Model for 3D Scene Understanding and Generation

Source: HuggingFace Daily Papers (2026-05-07) Paper: arXiv 2604.28196 · HF Raw: raw

TL;DR

HERMES++ unifies 3D scene understanding and future geometry prediction in a single driving world model. A BEV (bird's-eye-view) representation feeds the LLM-compatible spatial structure, LLM-enhanced world queries transfer knowledge from the understanding branch, a Current-to-Future Link conditions geometric evolution on semantic context, and Joint Geometric Optimisation enforces structural integrity through explicit constraints plus implicit latent regularisation. Outperforms specialist baselines on both future point cloud prediction and 3D scene understanding.

Tier note

Tier 4 (3D mapping / driving). Listed for completeness. The interesting structural claim, that semantic-understanding queries can guide geometric prediction within a unified world model, generalises beyond driving but is not tested outside it.

Related