In the previous article , we explored an example where reinforcement learning is required and standard methods do not work. In this article, we will understand why policy gradients are needed, and why the standard backpropagation method does not work in certain situations. How Backpropagation Normally Works Assume we have the following training data, where the desired outputs are already known: Input (Hunger) Output p(B) 0.0 0 1.0 1 0.1 0 0.9 1 With this data, we can feed the input values into the neural network one at a time. The neural network produces an output, and we compare it with the ideal output value from the training data. Using this difference, we can measure how wrong the network is. Using Derivatives to Update the Bias We can calculate these differences for different values of the bias and visualize how the error changes as the bias changes. From this graph, we can calculate the derivative .…