The prevailing view has been that autonomous‑driving world models must choose between two extremes: a perception‑only pipeline that reconstructs the current bird’s‑eye‑view (BEV) layout, or a generative model that rolls forward future geometry without a semantic grasp of the scene. HERMES++ demonstrates that a single network can inhabit both roles, answering natural‑language queries while extrapolating the road ahead. Previously, scene‑understanding systems relied on dense BEV encoders tuned for detection and segmentation, whereas future‑prediction work such as point‑cloud roll‑outs treated the problem as a pure geometric sequence, often ignoring high‑level intent. Large language models, meanwhile, excel at reasoning over text but have no built‑in notion of spatial dynamics, leaving a gap between semantic instruction and physical simulation. HERMES++ closes that gap with three key mechanisms.…