PPO | cfh::blog

The next step after implementing A2C is to move on to Proximal Policy Optimization (PPO). Introduced in a paper by OpenAI researchers in 2017, it has become a very popular RL algorithm since. It can be understood as a simplified variant of Trust Region Policy Optimization (TRPO), and one of its main advantages is improved sample efficiency: although it is an on-policy algorithm, it can robustly learn multiple times from a batch of generated samples, unlike A2C and REINFORCE. ...