Menu

Post image 1
Post image 2
Post image 3
Post image 4
1 / 4
0

Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?

DEV Community: pytorch·Ingero Team·3 days ago
#iPThd87N
Reading 0:00
15s threshold

TL;DR After del tensor; torch.cuda.empty_cache() , PyTorch's caching allocator still holds 53.7 MB that it won't release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly what happens at the kernel level during the free path. The trace showed cudaFree calls hitting p99 = 1.9ms (4.6x their median) because the process keeps getting descheduled mid-free. The allocator isn't broken - the OS is interrupting it. The Issue pytorch/pytorch#173382 - a user calls torch.cuda.empty_cache() after deleting tensors, but GPU memory stays allocated. The caching allocator's empty_cache() only releases blocks it has marked as free, but the user sees a persistent gap between "allocated" and "reserved" memory. We traced what happens when torch cuda empty cache runs on an RTX 4090 and measured exactly how much GPU memory it reclaims. The docs say it releases "unoccupied cached memory." But how do you tell which blocks are occupied, which are free, and what's holding them?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More