Last time we saw how the entropy bonus enables self-play training without running into policy collapse. However, the model we trained was quite small and probably not capable of very strong play. Before we dive into the details of an improved model architecture, it would be very helpful to have a decent, fixed benchmark to gauge our progress.
A benchmark opponent
The only model with fixed performance we have right now is the RandomPlayer
from the
basic setup post. Obviously, that’s not a challenging
bar to clear. But it turns out that with some small tweaks, we can turn the fully random
player into a formidable opponent for our starter models.
The algorithm is very simple:
- If there is a winning move on the board, play it.
- If not, and if the opponent has a winning move, block it.
- Otherwise, play a random move.
We call this model the RandomPunisher
since, although it doesn’t have any concept of
strategy, it will ruthlessly punish tactical mistakes.
Here’s an implementation of that idea in PyTorch:
@torch.jit.script
def find_best_move(board):
"""
Finds a winning move for the current player (represented as +1) in the
given board state (R, C).
Then checks for potential winning moves of the opponent and blocks them.
Otherwise, moves randomly.
Returns logits, shape (C,).
"""
cols = board.shape[-1]
for B in (board, -board):
for c in range(cols):
# Check if the move is valid
if B[0, c] == 0:
_, win = make_move_and_check(B, c)
if win:
choice = torch.tensor(c, device=board.device)
logits = nn.functional.one_hot(choice, num_classes=cols).float()
# Avoid log(0) by adding a small epsilon
return torch.log(logits + 1e-12)
return torch.zeros((cols,), dtype=torch.float32, device=board.device)
class RandomPunisher(nn.Module):
"""Plays a winning or blocking move if available, otherwise plays a random move."""
def forward(self, x):
# Store original shape and determine batch size
original_shape = x.shape
if x.ndim == 2: # Single board state (R, C)
x = x.unsqueeze(0) # Add a batch dimension: (1, R, C)
batch_size = x.size(0)
logits = torch.stack([find_best_move(x[i]) for i in range(batch_size)])
if len(original_shape) == 2:
logits = logits.squeeze(0) # -> (C,)
return logits
The function find_best_move
is the core of the strategy; it calls the function
make_move_and_check
, which we already used in our earlier post, to check each valid
move to see if it would result in a win. It does this first for the actual input
board
, and if no winning move was found, repeats the procedure for -board
. This
is the board from the opponent’s view and finds any potentially winning moves we have to
block.
For any move it chooses, it applies log to a one-hot encoding (plus some small number) of that move so that we end up with a logit of 0 for the chosen move and sufficiently large negative logits for the others. After softmax, this effectively results in probability 1 for the chosen move.
The actual RandomPunisher
module then simply calls this function in a loop for each
board state in the input batch. This is not the most efficient implementation,
but the use of the @torch.jit.script
decorator, which tells PyTorch to just-in-time
compile the function, goes a long way towards speeding up the generally slow Python loops.
Despite the simplicity of this algorithm, it’s challenging for a model which doesn’t yet have a strong grasp of basic tactics to achieve a consistently positive win rate against this guy.
For instance, I continued the self-play loop for the SimpleMLPModel
from the last post,
and despite some attempts at tuning the hyperparameters
(starting with a learning rate of 1e-3, then reducing to 1e-4;
starting with an entropy bonus of 0.05, then decreasing to 0.03), it was difficult
to get consistently beyond a 35% win rate against the RandomPunisher
.
Presumably, the simple MLP model lacks the capacity for strong play, and we need a better design, which is the topic of the next post.