Reinforcement Learning (RL) is the driving algorithm behind AlphaGo, the machine the beat a Go master. In this article, we explore how the components of an RL system come together in an algorithm that is able to learn.
Our goal in this series is to gain a better understanding of how DeepMind constructed a learning machine — AlphaGo — that was able beat a worldwide Go master. In the first article, we discussed why AlphaGo’s victory represents a breakthrough in computer science. In the the second article, we attempted to demystify machine learning (ML) in general, and reinforcement learning (RL) in particular, by providing a 10,000-foot view of traditional ML and unpacking the main components of an RL system. We discussed how RL agents operate in a flowchart-like world represented by a Markov Decision Process (MDP), and how they seek to optimize their decisions by determining which action in any given state yields the most cumulative future reward. We also defined two important functions, the state-value function (represented mathematically as V) and the action-value function (represented as Q), that RL agents use to guide their actions. In this article, we’ll put all the pieces together to explain how a self-learning algorithm works.
The state-value and action-value functions are the critical bits that makes RL tick. These functions quantify how much each state or action is estimated to be worth in terms of its anticipated, cumulative, future reward. Choosing an action that leads the agent to a state with a high state-value is tantamount to making a decision that maximizes long-term reward — so it goes without saying that getting these functions right is critical. The challenge is, however, that figuring out V and Q is difficult. In fact, one of the main areas of focus in the field of reinforcement learning is finding better and faster ways to accomplish this.
One challenge faced when calculating V and Q is that the value of a given state, let’s say state A, is dependent on the value of other states, and the values of these other states are in turn dependent on the value of state A. This results in a classic chicken-or-the-egg problem: The value of state A depends on the value of state B, but the value of state B depends on the value of state A. It’s circular logic.