Proximal Policy Optimization (PPO)

The next step after implementing A2C is to move on to Proximal Policy Optimization (PPO). Introduced in a paper by OpenAI researchers in 2017, it has become a very popular RL algorithm since. It can be understood as a simplified variant of Trust Region Policy Optimization (TRPO), and one of its main advantages is improved sample efficiency: although it is an on-policy algorithm, it can robustly learn multiple times from a batch of generated samples, unlike A2C and REINFORCE. ...

May 25, 2025 · cfh

Multi-Step Bootstrapping

Until now, we’ve done one-step lookahead for the TD bootstrapping in the A2C algorithm. We can significantly improve upon this by looking further ahead. Bootstrapping with one step Looking back at the states-values-rewards diagram in Implementing A2C, we had state \(s_i\) transitioning into state \(s_{i+1}\) with an immediate reward \(R_i\). How we actually implemented bootstrapping was subtly different and better described by this diagram: s₀ v(s₀) s₁ v(s₁) s₂ v(s₂) s'₀ v(s'₀) s'₁ v(s'₁) s'₂ v(s'₂) R₀ R'₀ R₁ R'₁ R₂ States and rewards diagram for own states si and opponent states s'i. ...

May 11, 2025 · cfh

Actor-Critic Algorithms

After implementing and evaluating REINFORCE with baseline, we found that it can produce strong models, but takes a long time to learn an accurate value function due to the high variance of the Monte Carlo samples. In this post, we’ll look at Actor-Critic methods, and in particular the Advantage Actor-Critic (A2C) algorithm1, a synchronous version of the earlier Asynchronous Advantage Actor-Critic (A3C) method, as a way to remedy this. Before we start, recall that we introduced a value network as a component of our model; this remains the same for A2C, and in fact we don’t need to modify the network architecture at all to use this newer algorithm. Our model still consists of a residual CNN backbone, a policy head and a value head. The value head serves as the “critic,” whereas the policy head is the “actor”. ...

May 8, 2025 · cfh

REINFORCE with Baseline

In the previous post, we introduced a stronger model but observed that it’s quite challenging to achieve a high level of play with basic REINFORCE, due to the high variance and noisy gradients of the algorithm which often lead to unstable learning and slow convergence. Our first step towards more advanced algorithms is a modification called “REINFORCE with baseline” (see, e.g., Sutton et al. (2000)). The value network Given a board state \(s\), recall that our model currently outputs seven raw logits which are then transformed via softmax into the probability distribution \(p(s)\) over the seven possible moves. Many advanced algorithms in RL assume that our network also outputs a second piece of information: the value \(v(s)\), a number between -1 and 1 which, roughly speaking, gives an estimate of how confident the model is in winning from the current position. ...

April 29, 2025 · cfh

On Entropy

The last time, we ran our first self-play training loop on a simple MLP model and observed catastrophic policy collapse. Let’s first understand some of the math behind what happened, and then how to combat it. What is entropy? Given a probability distribution \(p=(p_1,\ldots,p_C)\) over a number of categories \(i=1,\ldots,C\), such as the distribution over the columns our Connect 4 model outputs for a given board state, entropy measures the “amount of randomness” and is defined as1 ...

April 23, 2025 · cfh

A First Training Run and Policy Collapse

With the REINFORCE algorithm under our belt, we can finally attempt to start training some models for Connect 4. However, as we’ll see, there are still some hurdles in our way before we get anywhere. It’s good to set your expectations accordingly because rarely if ever do things go smoothly the first time in RL. Runnable Example connect-zero/train/example1-collapse.py A simple MLP model As a fruitfly of Connect 4-playing models, let’s start with a simple multilayer perceptron (MLP) model that follows the model protocol we outlined earlier: that means that it has an input layer taking a 6x7 int8 board state tensor, a few simple hidden layers consisting of just a linear layer and a ReLU activation function each, and an output layer of 7 neurons without any activation function—that’s exactly what we meant earlier when we said that the model should output raw logits. ...

April 21, 2025 · cfh

The REINFORCE Algorithm

Let’s say we have a Connect 4-playing model and we let it play a couple of games. (We haven’t really talked about model architecture until now, so for now just imagine a simple multilayer perceptron with a few hidden layers which outputs 7 raw logits, as discussed in the previous post.) As it goes in life, our model wins some and loses some. How do we make it actually learn from its experiences? How does the magic happen? ...

April 20, 2025 · cfh

Preparing the Data

With the triangle self-intersection algorithm ready to go, we can start gathering the training data for our machine learning setup. But first we have to think about how exactly we want to represent it. Canonical triangles The curved triangles we work with are specified by six 3D vectors, so that would mean 18 floating point numbers as our input data. But an important insight is that whether a triangle intersects itself doesn’t change when we rotate it, translate it, or uniformly scale it—it’s well known that affine transformations of spline control points result in affine transformations of the surface itself. ...

April 9, 2025 · cfh

Getting Accurate Intersections with Gauss-Newton

In the last post, we found pairs of subtriangles of our curved triangle which intersect. The subtriangles were linear approximations, which means that the intersection points we found are also only approximate. This might be good enough for our purposes, but in the interest of getting training data that’s as accurate as possible, we will refine these intersections by projecting them onto the exact curved triangle. To be precise, we are looking for two distinct parameter pairs \((u_1, v_1)\) and \((u_2, v_2)\) within the triangle’s domain such that their mappings coincide, ...

April 8, 2025 · cfh

Computing Self-Intersections, the Geometric Way

Before we can apply ML to the triangle problem, we need to be able to compute self-intersections of a curved triangle in an accurate and efficient way so that we can generate enough training data. The basic approach is: Subdivide the curved triangle into smaller subtriangles Find potentially intersecting pairs of subtriangles Check for actual intersections among these candidate pairs Subdividing the triangle We split the original triangle into a list of sufficiently flat subtriangles by a simple recursive procedure, starting with the full triangle {(0,0), (1,0), (0,1)}: ...

April 7, 2025 · cfh