TL;DR A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it. The Problem A model trains fine for hours, then suddenly: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity; 10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved) Enter fullscreen mode Exit fullscreen mode Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough. This is the #1 GPU debugging pain point for ML engineers. Everyone hits it.…