NaNs don’t originate where they appear — they silently propagate across layers torch.autograd.set_detect_anomaly is too slow and often misleading for real debugging A forward hook–based detector can catch NaNs at the exact layer and batch they first occur Overhead is ~3–4 ms per forward pass, far lower than anomaly detection (especially on GPU) Gradient explosion is the real root cause in most cases — catching it early prevents NaNs entirely The system logs structured events (layer, batch, stats) for precise debugging Designed for production: thread-safe, memory-bounded, and scalable It was batch 47,000. A ResNet variant I had been training for six hours on a custom medical imaging dataset. The loss was converging cleanly — 1.4, 1.1, 0.87, 0.73 — and then, nothing. Not an error. Not a crash. Just nan . I added torch.autograd.set_detect_anomaly(True) and restarted.…