Proximal Policy Optimization (PPO)

The next step after implementing A2C is to move on to Proximal Policy Optimization (PPO). Introduced in a paper by OpenAI researchers in 2017, it has become a very popular RL algorithm since. It can be understood as a simplified variant of Trust Region Policy Optimization (TRPO), and one of its main advantages is improved sample efficiency: although it is an on-policy algorithm, it can robustly learn multiple times from a batch of generated samples, unlike A2C and REINFORCE. ...

May 25, 2025 · cfh

Multi-Step Bootstrapping

Until now, we’ve done one-step lookahead for the TD bootstrapping in the A2C algorithm. We can significantly improve upon this by looking further ahead. Bootstrapping with one step Looking back at the states-values-rewards diagram in Implementing A2C, we had state \(s_i\) transitioning into state \(s_{i+1}\) with an immediate reward \(R_i\). How we actually implemented bootstrapping was subtly different and better described by this diagram: s₀ v(s₀) s₁ v(s₁) s₂ v(s₂) s'₀ v(s'₀) s'₁ v(s'₁) s'₂ v(s'₂) R₀ R'₀ R₁ R'₁ R₂ States and rewards diagram for own states si and opponent states s'i. ...

May 11, 2025 · cfh

Evaluating A2C versus REINFORCE with baseline

With our implementation of A2C ready to go, let’s see it in action. Runnable Example connect-zero/train/example4-a2c.py The setup Let’s set the ground rules: ...

May 11, 2025 · cfh

Implementing A2C

In the previous post, we outlined the general concept of Actor-Critic algorithms and A2C in particular; it’s time to implement a simple version of A2C in PyTorch. Changing the reward function As we noted, our model class doesn’t need to change at all: it already has the requisite value head we introduced when we implemented REINFORCE with baseline. First off, we need to change the way the rewards are computed. We introduce a flag BOOTSTRAP_VALUE which is on when we use A2C. Based on this, we compute the rewards vector for a game with an outcome of +1 for a win, 0 for a draw, and -1 for a loss like this: ...

May 10, 2025 · cfh

Actor-Critic Algorithms

After implementing and evaluating REINFORCE with baseline, we found that it can produce strong models, but takes a long time to learn an accurate value function due to the high variance of the Monte Carlo samples. In this post, we’ll look at Actor-Critic methods, and in particular the Advantage Actor-Critic (A2C) algorithm1, a synchronous version of the earlier Asynchronous Advantage Actor-Critic (A3C) method, as a way to remedy this. Before we start, recall that we introduced a value network as a component of our model; this remains the same for A2C, and in fact we don’t need to modify the network architecture at all to use this newer algorithm. Our model still consists of a residual CNN backbone, a policy head and a value head. The value head serves as the “critic,” whereas the policy head is the “actor”. ...

May 8, 2025 · cfh

Implementing and Evaluating REINFORCE with Baseline

Having introduced REINFORCE with baseline on a conceptual level, let’s implement it for our Connect 4-playing CNN model. Runnable Example connect-zero/train/example3-rwb.py Adding the value head In the constructor of the Connect4CNN model class, we set up the new network for estimating the board state value \(v(s)\) which will consume the same 448 downsampled features that the policy head receives: ...

May 1, 2025 · cfh

REINFORCE with Baseline

In the previous post, we introduced a stronger model but observed that it’s quite challenging to achieve a high level of play with basic REINFORCE, due to the high variance and noisy gradients of the algorithm which often lead to unstable learning and slow convergence. Our first step towards more advanced algorithms is a modification called “REINFORCE with baseline” (see, e.g., Sutton et al. (2000)). The value network Given a board state \(s\), recall that our model currently outputs seven raw logits which are then transformed via softmax into the probability distribution \(p(s)\) over the seven possible moves. Many advanced algorithms in RL assume that our network also outputs a second piece of information: the value \(v(s)\), a number between -1 and 1 which, roughly speaking, gives an estimate of how confident the model is in winning from the current position. ...

April 29, 2025 · cfh

Introducing a Benchmark Opponent

Last time we saw how the entropy bonus enables self-play training without running into policy collapse. However, the model we trained was quite small and probably not capable of very strong play. Before we dive into the details of an improved model architecture, it would be very helpful to have a decent, fixed benchmark to gauge our progress. A benchmark opponent The only model with fixed performance we have right now is the RandomPlayer from the basic setup post. Obviously, that’s not a challenging bar to clear. But it turns out that with some small tweaks, we can turn the fully random player into a formidable opponent for our starter models. ...

April 26, 2025 · cfh

Entropy Regularization

Based on our discussion on entropy, our plan is to implement entropy regularization via an entropy bonus in our loss function. Runnable Example connect-zero/train/example2-entropy.py Implementing the entropy bonus The formula for entropy which we have to implement, ...

April 24, 2025 · cfh

On Entropy

The last time, we ran our first self-play training loop on a simple MLP model and observed catastrophic policy collapse. Let’s first understand some of the math behind what happened, and then how to combat it. What is entropy? Given a probability distribution \(p=(p_1,\ldots,p_C)\) over a number of categories \(i=1,\ldots,C\), such as the distribution over the columns our Connect 4 model outputs for a given board state, entropy measures the “amount of randomness” and is defined as1 ...

April 23, 2025 · cfh