Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
Post image 8
Post image 9
1 / 9
0

PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer | Towards Data Science

Towards Data Science·Emmimal P Alexander·about 1 month ago
#quhK3o92
Reading 0:00
15s threshold

NaNs don’t originate where they appear — they silently propagate across layers torch.autograd.set_detect_anomaly is too slow and often misleading for real debugging A forward hook–based detector can catch NaNs at the exact layer and batch they first occur Overhead is ~3–4 ms per forward pass, far lower than anomaly detection (especially on GPU) Gradient explosion is the real root cause in most cases — catching it early prevents NaNs entirely The system logs structured events (layer, batch, stats) for precise debugging Designed for production: thread-safe, memory-bounded, and scalable It was batch 47,000. A ResNet variant I had been training for six hours on a custom medical imaging dataset. The loss was converging cleanly — 1.4, 1.1, 0.87, 0.73 — and then, nothing. Not an error. Not a crash. Just nan . I added torch.autograd.set_detect_anomaly(True) and restarted.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More