Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

DEV Community·Ingero Team·28 days ago
#GpxHkwl8
Reading 0:00
15s threshold

TL;DR A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it. The Problem A model trains fine for hours, then suddenly: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity; 10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved) Enter fullscreen mode Exit fullscreen mode Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough. This is the #1 GPU debugging pain point for ML engineers. Everyone hits it.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More