TL;DR: torch.compile gave us a 2.3x speedup on our SDXL pipeline in benchmarks, then quietly recompiled 38 times across the first 100 production requests because every customer uploads a product photo at a different resolution. The fix wasn't turning compile off. It was understanding what counts as a guard, bucketing inputs to fixed shapes, and reading the recompilation logs PyTorch 2.3 gives you for free. The benchmark that lied to me At Photoroom we run diffusion models for product photography. Someone uploads a sneaker on a kitchen table, and the model gives it a clean studio background. The UNet is the heavy part, so when PyTorch 2.3 promised free speedups through torch.compile , I spent a week wiring it in. The benchmark looked great. Fixed 1024x1024 input, batch size 4, an A10G. 2.3x faster than eager mode after warmup. I shipped it to a 5% canary. p99 latency went up . Not by a little. Some requests took 70 seconds longer than before the change.…