Proximal Policy Optimization (PPO)

The next step after implementing A2C is to move on to Proximal Policy Optimization (PPO). Introduced in a paper by OpenAI researchers in 2017, it has become a very popular RL algorithm since. It can be understood as a simplified variant of Trust Region Policy Optimization (TRPO), and one of its main advantages is improved sample efficiency: although it is an on-policy algorithm, it can robustly learn multiple times from a batch of generated samples, unlike A2C and REINFORCE. ...

May 25, 2025 · cfh

Multi-Step Bootstrapping

Until now, we’ve done one-step lookahead for the TD bootstrapping in the A2C algorithm. We can significantly improve upon this by looking further ahead. Bootstrapping with one step Looking back at the states-values-rewards diagram in Implementing A2C, we had state \(s_i\) transitioning into state \(s_{i+1}\) with an immediate reward \(R_i\). How we actually implemented bootstrapping was subtly different and better described by this diagram: s₀ v(s₀) s₁ v(s₁) s₂ v(s₂) s'₀ v(s'₀) s'₁ v(s'₁) s'₂ v(s'₂) R₀ R'₀ R₁ R'₁ R₂ States and rewards diagram for own states si and opponent states s'i. ...

May 11, 2025 · cfh

Implementing A2C

In the previous post, we outlined the general concept of Actor-Critic algorithms and A2C in particular; it’s time to implement a simple version of A2C in PyTorch. Changing the reward function As we noted, our model class doesn’t need to change at all: it already has the requisite value head we introduced when we implemented REINFORCE with baseline. First off, we need to change the way the rewards are computed. We introduce a flag BOOTSTRAP_VALUE which is on when we use A2C. Based on this, we compute the rewards vector for a game with an outcome of +1 for a win, 0 for a draw, and -1 for a loss like this: ...

May 10, 2025 · cfh

Implementing and Evaluating REINFORCE with Baseline

Having introduced REINFORCE with baseline on a conceptual level, let’s implement it for our Connect 4-playing CNN model. Runnable Example connect-zero/train/example3-rwb.py Adding the value head In the constructor of the Connect4CNN model class, we set up the new network for estimating the board state value \(v(s)\) which will consume the same 448 downsampled features that the policy head receives: ...

May 1, 2025 · cfh

Model Design for Connect 4

With the fearsome RandomPunisher putting our first Connect 4 toy model in its place, it’s time to design something that stands a chance. A design based on CNNs It’s standard practice for board-game playing neural networks to have at least a few convolutional neural network (CNN) layers at the initial inputs. This shouldn’t come as a surprise: the board is a regular grid, much like an image, and CNNs are strong performers in image processing. In our case, it will allow the model to learn features like “here are three of my pieces in a diagonal downward row” which are then automatically applied to every position on the board, rather than having to re-learn these features individually at each board position. ...

April 28, 2025 · cfh

Introducing a Benchmark Opponent

Last time we saw how the entropy bonus enables self-play training without running into policy collapse. However, the model we trained was quite small and probably not capable of very strong play. Before we dive into the details of an improved model architecture, it would be very helpful to have a decent, fixed benchmark to gauge our progress. A benchmark opponent The only model with fixed performance we have right now is the RandomPlayer from the basic setup post. Obviously, that’s not a challenging bar to clear. But it turns out that with some small tweaks, we can turn the fully random player into a formidable opponent for our starter models. ...

April 26, 2025 · cfh

Entropy Regularization

Based on our discussion on entropy, our plan is to implement entropy regularization via an entropy bonus in our loss function. Runnable Example connect-zero/train/example2-entropy.py Implementing the entropy bonus The formula for entropy which we have to implement, ...

April 24, 2025 · cfh

A First Training Run and Policy Collapse

With the REINFORCE algorithm under our belt, we can finally attempt to start training some models for Connect 4. However, as we’ll see, there are still some hurdles in our way before we get anywhere. It’s good to set your expectations accordingly because rarely if ever do things go smoothly the first time in RL. Runnable Example connect-zero/train/example1-collapse.py A simple MLP model As a fruitfly of Connect 4-playing models, let’s start with a simple multilayer perceptron (MLP) model that follows the model protocol we outlined earlier: that means that it has an input layer taking a 6x7 int8 board state tensor, a few simple hidden layers consisting of just a linear layer and a ReLU activation function each, and an output layer of 7 neurons without any activation function—that’s exactly what we meant earlier when we said that the model should output raw logits. ...

April 21, 2025 · cfh

The REINFORCE Algorithm

Let’s say we have a Connect 4-playing model and we let it play a couple of games. (We haven’t really talked about model architecture until now, so for now just imagine a simple multilayer perceptron with a few hidden layers which outputs 7 raw logits, as discussed in the previous post.) As it goes in life, our model wins some and loses some. How do we make it actually learn from its experiences? How does the magic happen? ...

April 20, 2025 · cfh

Basic Setup and Play

Let’s get into a bit more technical detail on how our Connect 4-playing model will be set up, and how a basic game loop works. Throughout all code samples we’ll always assume the standard PyTorch imports: import torch import torch.nn as nn import torch.nn.functional as F Board state The current board state will be represented by a 6x7 PyTorch int8 tensor, initially filled with zeros. board = torch.zeros((ROWS, COLS), dtype=torch.int8, device=DEVICE) The board is ordered such that board[0, :] is the top row. A non-empty cell is represented by +1 or -1. To simplify things, we always represent the player whose move it currently is by +1, and the opponent by -1. This way we don’t need any separate state to keep track of whose move it is. After a move has been made, we simply flip the board by doing ...

April 20, 2025 · cfh