Diffusion as a unifying backbone for multimodal generation Latent diffusion now drives both image synthesis and video creation. Continuous‑time distribution matching reduces diffusion steps to a few while retaining fidelity [1] . Segment‑wise video diffusion extends the same idea to image‑to‑video tasks, cutting inference cost [2] . The gap is conditioning: current models still lack native text or segmentation prompts, limiting end‑to‑end multimodal pipelines. Modular expert routing and adaptive compute UniPool replaces per‑layer mixtures of experts with a single shared pool and a pooling loss, shrinking the expert parameter budget without hurting performance [3] . NormRouter further stabilises routing decisions across layers. In sequential decision‑making, FFDC’s verification module compares imagined and observed futures, then shortens or lengthens action chunks on the fly, slashing forward passes while keeping success rates high [4] .…