Diffusion research has long treated image synthesis and video synthesis as separate engineering problems, each with its own heavyweight model and multi‑step inference pipeline. Recent work shows that a single latent diffusion backbone can be conditioned for text‑to‑image and high‑resolution video generation while still operating in just a handful of sampling steps. Historically, image diffusion required dozens of denoising steps, and video diffusion compounded the cost with per‑frame processing or costly cascades. Acceleration techniques fell into two camps: consistency distillation, which enforces self‑consistency along the entire probability‑flow ODE, and discrete distribution‑matching distillation that anchors supervision at a few fixed timesteps. Both approaches traded fidelity for speed or introduced auxiliary adversarial modules to patch visual artifacts.…