series about Reinforcement Learning (RL), following Sutton and Barto’s famous book “Reinforcement Learning” [1]. In the previous posts we finished dissecting Part I of said book, which introduces fundamental solution techniques which form the basis for many RL methods. These are: Dynamic Programming (DP) , Monte Carlo methods (MC) and Temporal Difference Learning (TD) . What separates Part I from Part II of Sutton’s book, and justifies the distinction, is a constraint on the problem size: while in Part I tabular solution methods were covered, we now dare to dive deeper into this fascinating topics and include function approximation. To make it specific, in Part I we assumed the state space of the problems under investigation to be small enough s.t. we could represent it and also the found solutions via a simple table (imagine a table denoting a certain “goodness” – a value – for each state). Now, in Part II, we drop this assumption, and are thus able to tackle arbitrary problems.…