TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle for ~2 seconds waiting for Python and NumPy to catch…