PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

1 / 9

PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer | Towards Data Science

Towards Data Science·Emmimal P Alexander·about 1 month ago

#quhK3o92

#deepdives #torch #editorspicks #newsletter #aiengineering #pytorch

Reading 0:00

15s threshold

NaNs don’t originate where they appear — they silently propagate across layers torch.autograd.set_detect_anomaly is too slow and often misleading for real debugging A forward hook–based detector can catch NaNs at the exact layer and batch they first occur Overhead is ~3–4 ms per forward pass, far lower than anomaly detection (especially on GPU) Gradient explosion is the real root cause in most cases — catching it early prevents NaNs entirely The system logs structured events (layer, batch, stats) for precise debugging Designed for production: thread-safe, memory-bounded, and scalable It was batch 47,000. A ResNet variant I had been training for six hours on a custom medical imaging dataset. The loss was converging cleanly — 1.4, 1.1, 0.87, 0.73 — and then, nothing. Not an error. Not a crash. Just nan . I added torch.autograd.set_detect_anomaly(True) and restarted.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer | Towards Data Science