Multi-Step Bootstrapping

Until now, we’ve done one-step lookahead for the TD bootstrapping in the A2C algorithm. We can significantly improve upon this by looking further ahead. Bootstrapping with one step Looking back at the states-values-rewards diagram in Implementing A2C, we had state \(s_i\) transitioning into state \(s_{i+1}\) with an immediate reward \(R_i\). How we actually implemented bootstrapping was subtly different and better described by this diagram: s₀ v(s₀) s₁ v(s₁) s₂ v(s₂) s'₀ v(s'₀) s'₁ v(s'₁) s'₂ v(s'₂) R₀ R'₀ R₁ R'₁ R₂ States and rewards diagram for own states si and opponent states s'i. ...

May 11, 2025 · cfh

Evaluating A2C versus REINFORCE with baseline

With our implementation of A2C ready to go, let’s see it in action. Runnable Example connect-zero/train/example4-a2c.py The setup Let’s set the ground rules: ...

May 11, 2025 · cfh

Implementing A2C

In the previous post, we outlined the general concept of Actor-Critic algorithms and A2C in particular; it’s time to implement a simple version of A2C in PyTorch. Changing the reward function As we noted, our model class doesn’t need to change at all: it already has the requisite value head we introduced when we implemented REINFORCE with baseline. First off, we need to change the way the rewards are computed. We introduce a flag BOOTSTRAP_VALUE which is on when we use A2C. Based on this, we compute the rewards vector for a game with an outcome of +1 for a win, 0 for a draw, and -1 for a loss like this: ...

May 10, 2025 · cfh

Actor-Critic Algorithms

After implementing and evaluating REINFORCE with baseline, we found that it can produce strong models, but takes a long time to learn an accurate value function due to the high variance of the Monte Carlo samples. In this post, we’ll look at Actor-Critic methods, and in particular the Advantage Actor-Critic (A2C) algorithm1, a synchronous version of the earlier Asynchronous Advantage Actor-Critic (A3C) method, as a way to remedy this. Before we start, recall that we introduced a value network as a component of our model; this remains the same for A2C, and in fact we don’t need to modify the network architecture at all to use this newer algorithm. Our model still consists of a residual CNN backbone, a policy head and a value head. The value head serves as the “critic,” whereas the policy head is the “actor”. ...

May 8, 2025 · cfh