In this article, we will try to understand Open-AI’s Proximal Policy Optimization algorithm for reinforcement learning. After some basic theory, we will be implementing PPO with TensorFlow 2.x. Before you read further, I would recommend you take a look at the Actor-Critic method from here, as we will be modifying the code of that article for PPO.

Why PPO?

  1. Unstable Policy Update: In Many Policy Gradient Methods, policy updates are unstable because of larger step size, which leads to bad policy updates and when this new bad policy is used for learning then it leads to even worse policy. And if steps are small then it leads to slower learning.
  2. Data Inefficiency: Many learning methods learn from current experience and discard the experiences after gradient updates. This makes the learning process slow as a neural net takes lots of data to learn.

PPO comes handy to overcome the above issues.

#reinforcement-learning #data-science #machine-learning #deep-learning #tensorflow

Proximal Policy Optimization (PPO) With TensorFlow 2.x
16.05 GEEK