A Minimal Working Example for Continuous Policy Gradients

At the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making policy is often represented by a neural network. In continuous control problems, this network outputs the relevant distribution parameters to sample appropriate actions.

With so many deep reinforcement learning algorithms in circulation, you’d expect it to be easy to find abundant plug-and-play TensorFlow implementations for a basic actor network in continuous control, but this is hardly the case. Various reasons may exist for this. First, TensorFlow 2.0 was released only in September 2019, differing quite substantially from its predecessor. Second, most implementations focus on discrete action spaces rather than continuous ones. Third, there are many different implementations in circulation, yet some are tailored such that they only work in specific problem settings. It can be a tad frustrating to plow through several hundred lines of code riddled with placeholders and class members, only to find out the approach is not suitable to your problem after all. This article — based on our ResearchGate note [1] — provides a minimal working example that functions in TensorFlow 2.0. We will show that the real magic happens in only three lines of code!

Some mathematical background

In this article, we present a simple and generic implementation for an actor network in the context of the vanilla policy gradient algorithm REINFORCE [2]. In the continuous variant, we usually draw actions from a Gaussian distribution; the goal is to learn an appropriate mean μ and a standard deviation σ. The actor network learns and outputs these parameters.

Let’s formalize this actor network a bit more. Here, the input is the state s or a feature array ϕ(s), followed by one or more hidden layers that transform the input, with the output being μ and σ. Once obtaining this output, an action a is randomly drawn from the corresponding Gaussian distribution. Thus, we have _a=μ(s)+σ(s)ξ , _where ξ ∼ 𝒩(0,1).

#reinforcement-learning #actor-network #loss-function #continuous-control #tensorflow #function

Some mathematical background

towardsdatascience.com

A Minimal Working Example for Continuous Policy Gradients