Recently, I needed to search for trivial policies in an environment. These policies were simple things, like always take the same action, or every _K _steps take the same action, etc. In order to optimize these trivial policies as much as possible, I turned to the Optuna framework. But as I dove into their documentation, I found that they’ve got a nice set of optimizers and I began to wonder: What if I searched for not-so-trivial policies?

Could I search for all of the parameters for one small neural network layer and get good performance? The answer, we’ll see, is yes… in some environments.

The Setup

I decided to keep it simple and experiment with two policies: one I’m calling a “dense Gaussian policy” and the other I’m calling a “dense policy”. I think these names are technically correct but they also can sound pretty opaque, so I’ll explain what I mean. Check out my code at this link if you’re interested.

The Policies

Both policies are similar to a 1-layer neural network: we compute the output by taking the dot-product of the input with an array of weights, W. In this case, the input is the observations (S) from the environment, and the output is the action we want to take, so the weight array (W) has the shape (observation_dimension, action_dimension).

In the “dense policy” the output of the dot product is the action, so we’re done! In the “dense gaussian policy”, the resulting vector of the dot product between S and _W _has shape equal to action_dimension and is used as the µ (mu, or mean) vector for a Gaussian distribution. We also store a variance (σ) vector in the layer and then we sample our actions from the resulting Gaussian distribution. Like this:

µ = np.dot(weight_array, observations)
action = µ + np.random.randn(env.action_space.shape[0]) * σ

I’ll break down the two lines of code above: In line 1, we take the dot product of W and the input observations and get µ (the mean of our distribution). In line 2, we use NumPy’s randn function to sample from a unit Gaussian distribution (mean 0 and variance 1). By adding µ to the sample from randn, we shift the mean from 0 to µ. Then, when we multiply the sample from randn by σ, its variance gets scaled to match σ.

So now, what am I optimizing? I used Optuna to directly optimize each parameter in the weight array. This approach scales really poorly, since Optuna is designed to optimize hyperparameters and it suggests one number at a time. This means you have to use a for loop to fill your weight array one-by-one. But in environments with small vector observations, it’s fast enough. Just know that if you’re here looking for performant, well-written, scalable code, you’re in the wrong place! I cobbled all this together for fun and gave little thought to how good the code actually is. I decided to use CMA-ES1 as my optimization method.

The Algorithm: CMA-ES

CMA-ES stands for “Covariance Matrix Adaptation Evolution Strategy”. It’s a black-box numerical optimization method. In fact, all evolution strategies are black-box optimization methods. The phrase “black-box” refers to the fact that these methods don’t require derivatives of an objective function in order to optimize it. They’re called “evolution strategies” because they loosely model biological evolution. They stochastically generate a population, evaluate the population on the objective function, and move the population-generating distribution towards the higher-performing members of the population.

If you squint, this kind of mimics natural selection, where the fittest members of a population propagate their traits onwards to the next generation. CMA-ES can also change the variance of the population it generates to find a solution faster, and that’s what the gif at the top is showing. If you want to read more about this topic, David Ha has two great posts talking about evolutionary strategies and using them in RL environments. I also recommend reading the Wikipedia page. If you google around, you’ll find a bunch of resources about the CMA-ES algorithm and evolution strategies in general.

The Results

This is a mixed bag of results; some great, some not so great. Working with one layer with shape (observation_dimension, action_dimension) really forces us to work with limited representational capacity. Since the policy has so few parameters, it can’t represent very complex relationships. Surprisingly, though, it can still succeed in chaotic and nonlinear environments.

Successes

Check out the inverted double pendulum. All 3-d robotics environments courtesy of Pybullet.

Even though this is a highly unstable and nonlinear system, the optimizer was still able to find a set of weights which can stably balance the pendulum. The policy here gets about 9,300 reward. I believe the maximum possible is 10,000. So it’s not perfect performance, but it’s still really good. I think this is super cool because it’s literally controlled with a 9-parameter fully-connected layer _and _that layer is optimized using a black-box (i.e. derivative free) optimizer. Compared to the massive networks we’re used to seeing, with thousands, millions, and sometimes billions of parameters it feels almost shocking that so much can be done with 9 parameters.

Just above here is the inverted pendulum swing-up problem. Again, this is an unstable, nonlinear problem. The goal here is to start with the pole below the cart and then to swing it up and balance it on top of the cart. This policy only has 5 parameters! It gets about 700 reward and the maximum possible is 1,000. So it’s definitely not perfect, but it seems clear that the policy learned something useful since in the video it makes multiple attempts to swing the pole up and keep it balanced.

This is a 16 parameter policy trained on Lunar Lander. The environment is modeled after the classic lunar lander game. The goal is simple: land the spaceship between the flags and don’t crash! It has 16 parameters because the action-space is 2 dimensional and the observation space is 8 dimensional. The 2 actions are real-valued numbers with one representing left-right thrust and the other representing bottom rocket thrust. This policy actually solves the environment; the reward here was about 275 and this environment is considered solved after the reward earned is over 200.

This is our last pendulum, but it’s also our most boring one. This is just the standard single inverted pendulum, where the pole starts on top of the cart and the goal is to keep it there. This system is near-linear and, while still unstable, it is certainly more stable than the other pendulums we looked at. This policy has 4 parameters and it solves the environment, earning the maximum reward of 1,000.

Failures

This is a 96 parameter policy that totally failed to learn how to achieve its goal. The environment is called “BipedalWalker” and the goal is for the robot to walk to the end of the level. Instead, our policy just learned to fall forwards. If I gave it more update steps or ran more experiments where I carefully control and anneal the variance of the policy and/or CMA-ES over time, it could _maybe _solve this environment. Giuseppe Cuccu solved this environment using a tiny, one-layer RNN with 116 parameters. I’m not confident that this environment can be solved with fewer parameters than that, but it might be a fun area to experiment in.

To give you an idea of what a successful BipedalWalker agent looks like, see the video below (I didn’t train this one):

The video below here is arguably a more interesting case of policy failure. This is on the hopper robot environment. The goal is to move forward as far as possible in the time limit. This policy has 45 parameters, since the observation space has 15 dimensions and the action space has 3 dimensions. Obviously, the policy fails at moving forward as far as possible, but it still gets about 1000 reward on the environment (this environment is considered solved above 2500 reward). It earns this reward simply by _leaning _forward!

#ai #tiny-machine-learning #python #open-source #hackernoon-top-story

Using Optuna to Search for Tiny RL Policies
1.10 GEEK