What You'll Build A complete training loop that processes documents, computes loss, backpropagates gradients, and updates parameters using the Adam optimiser. Depends On All previous chapters. The Training Loop A training step is just five things in a row: Pick a document and tokenize it Forward pass for each token, building up the loss Backward pass to fill in every gradient Nudge the parameters using those gradients Zero the gradients out before the next step Step 4 is where Adam lives. Before we look at the code, it's worth slowing down on what Adam actually does and why we use it. Understanding Adam You could update parameters with simple gradient descent: p.Data -= learningRate * p.Grad . Adam is smarter in two ways. Momentum ( momentum ). Instead of reacting to each individual gradient, Adam tracks a running average of recent gradients. This smooths out noisy updates, like a rolling ball that doesn't reverse direction every time it hits a bump. Squared gradient average ( squaredGradAvg ).β¦