Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)

📰

Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)

DEV Community·Elise Moreau·about 1 month ago

#machinelearning #pytorch #computervision #mlops #torch #first

Reading 0:00

15s threshold

TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at. I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds. My first instinct was wrong. I went straight to the UNet. Compiled it with torch.compile , tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved. Then I actually profiled. What the profile showed import torch from torch.profiler import profile , ProfilerActivity with profile ( activities = [ ProfilerActivity . CPU , ProfilerActivity .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)