Entropy Regularization

Based on our discussion on entropy, our plan is to implement entropy regularization via an entropy bonus in our loss function. Implementing the entropy bonus The formula for entropy which we have to implement, \[ H(p) = -\sum_{i=1}^{C} p_i \log p_i, \]is simple enough: multiply the probabilities for the seven possible moves with their log-probabilities, sum and negate. However, there is one numerical problem we have to worry about: masking out an illegal move \(i\) leads to a zero probability \(p_i=0\) and a log-probability \(\log p_i = -\infty\). However, due to the rules of IEEE 754 floating point numbers, multiplying zero with \(\pm\infty\) is undefined and therefore results in NaN (not a number). For the entropy formula, however, the contribution should be 0. ...

April 24, 2025 · cfh

On Entropy

The last time, we ran our first self-play training loop on a simple MLP model and observed catastrophic policy collapse. Let’s first understand some of the math behind what happened, and then how to combat it. What is entropy? Given a probability distribution \(p=(p_1,\ldots,p_C)\) over a number of categories \(i=1,\ldots,C\), such as the distribution over the columns our Connect 4 model outputs for a given board state, entropy measures the “amount of randomness” and is defined as1 ...

April 23, 2025 · cfh

A First Training Run and Policy Collapse

With the REINFORCE algorithm under our belt, we can finally attempt to start training some models for Connect 4. However, as we’ll see, there are still some hurdles in our way before we get anywhere. It’s good to set your expectations accordingly because rarely if ever do things go smoothly the first time in RL. A simple MLP model As a fruitfly of Connect 4-playing models, let’s start with a simple multilayer perceptron (MLP) model that follows the model protocol we outlined earlier: that means that it has an input layer taking a 6x7 int8 board state tensor, a few simple hidden layers consisting of just a linear layer and a ReLU activation function each, and an output layer of 7 neurons without any activation function—that’s exactly what we meant earlier when we said that the model should output raw logits. ...

April 21, 2025 · cfh

The REINFORCE Algorithm

Let’s say we have a Connect 4-playing model and we let it play a couple of games. (We haven’t really talked about model architecture until now, so for now just imagine a simple multilayer perceptron with a few hidden layers which outputs 7 raw logits, as discussed in the previous post.) As it goes in life, our model wins some and loses some. How do we make it actually learn from its experiences? How does the magic happen? ...

April 20, 2025 · cfh

Basic Setup and Play

Let’s get into a bit more technical detail on how our Connect 4-playing model will be set up, and how a basic game loop works. Throughout all code samples we’ll always assume the standard PyTorch imports: import torch import torch.nn as nn import torch.nn.functional as F Board state The current board state will be represented by a 6x7 PyTorch int8 tensor, initially filled with zeros. board = torch.zeros((ROWS, COLS), dtype=torch.int8, device=DEVICE) The board is ordered such that board[0, :] is the top row. A non-empty cell is represented by +1 or -1. To simplify things, we always represent the player whose move it currently is by +1, and the opponent by -1. This way we don’t need any separate state to keep track of whose move it is. After a move has been made, we simply flip the board by doing ...

April 20, 2025 · cfh

Connect-Zero: Reinforcement Learning from Scratch

For a long time I’ve wanted to get deeper into reinforcement learning (RL), and the project I finally settled on is teaching a neural network model how to play the classic game Connect 4 (pretty sneaky, sis!). Obviously, the name “Connect-Zero” is a cheeky nod to AlphaGo Zero and AlphaZero by DeepMind. I chose Connect 4 because it’s a simple game everyone knows how to play where we can hope to achieve good results without expensive hardware and high training costs. ...

April 20, 2025 · cfh

Connect 4

The computer opponent is a neural network trained using reinforcement learning. It was exported to ONNX and now runs right here in your browser. See Connect-Zero and the follow-up posts for details.

April 20, 2025 · cfh