Image-to-video generation is one of the most fascinating applications of generative AI. Let me break down how it works under the hood. The Core Architecture Modern image-to-video models are built on diffusion transformers (DiT). Unlike the original U-Net diffusion models used for images, DiTs use transformer blocks to process video latents. Latent Space Videos are first encoded into a compressed latent space using a VAE (Variational Autoencoder). A 512x512 video at 24fps becomes a much smaller tensor that the model can work with efficiently. Original: [frames × height × width × channels] Latent: [frames/4 × height/8 × width/8 × latent_dim] Enter fullscreen mode Exit fullscreen mode Temporal Attention The key innovation is temporal attention layers .…