Hey r/ML, I spent the last few months building a tool that hooks into PyTorch training loops to automatically detect and localize failures (vanishing gradients, exploding gradients, data anomalies). Along the way, I learned some things about training failure diagnosis that might be useful even if you never use the tool. The key insight: most training failures are local, not global When your loss spikes or vanishes, the natural instinct is to look at the loss curve. But the loss is a global aggregate — it tells you something went wrong, but not where. In my testing across hundreds of synthetic failure scenarios, the actual root cause is almost always localized to a specific layer at a specific step: Vanishing gradients: the failure starts at the deepest layer with saturated activations, then propagates backward Exploding gradients: the failure starts at the layer with the highest gradient norm, then propagates forward Data anomalies: the failure starts at the input layer, then corrupts everything downstream…