Introduction You've written model.to('cuda') a hundred times. You've celebrated when training loss went down. You've cursed when CUDA out of memory killed your run at 3am. But here's a question: do you actually know what happened inside that GPU? Not vaguely. Not "it's parallel" as a hand-wave. Do you know why a 4096×4096 matrix multiply finishes in 12 milliseconds on a GPU but takes 800 milliseconds on a CPU same math, same numbers, same code structure? If not, you're driving a Formula 1 car using only first gear. And that's exactly what most ML engineers do. This article is the foundation. Everything else in GPU optimization mixed precision, FlashAttention, quantization, vLLM is just a clever trick that exploits something about how GPUs physically work. If you understand the machine, the tricks become obvious. If you don't, they're just magic spells you copy from blog posts.…