Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks. Despite these successes, sampling inefficiency remains a fundamental bottleneck . Standard diffusion models require tens to hundreds of iterative denoising steps, leading to high inference latency and substantial computational cost. This limits practical deployment in interactive applications, edge devices, and large-scale production systems. Video generation faces an especially critical challenge. Open source models such as NVIDIA Cosmos —along with commercial text-to-video (T2V) systems —have shown remarkable text-to-video capabilities. However, video diffusion models are orders of magnitude more computationally demanding due to the temporal dimension.…