Brute Forcing The Cartpole Problem

The cartpole problem is a prominent benchmark problem in Reinforcement Learning, its implementation on OpenAI’s Gym being the most well known. It is also the most popular on their leaderboards; which makes sense, it is one of the easier problems in classic control. But just how easy is it?

As it stands right now, a solution needs to balance the pole for an average of 195 (or more) steps over 100 episodes. Keep in mind that each episode ends after 200 steps, so the solution does need to be somewhat stable.

For a while now, I have thought that balancing the pole for only 200 steps is too easy. So, in an attempt to confirm my suspicions, I have brute-forced the problem.

If you want to follow along with all the code that I used, please feel free to view it here: https://github.com/adamprice97/Control/blob/master/BruteForceCartepole.ipynb

In cryptography, a brute force attack first calculates all the possible password strings. It then applies these passwords iteratively until it finds the correct one. To use a similar approach in control, we need to generate a range of policies to control the cartpole. This would be impossible to do with continuous state space so we will have to discretise the state-space.

The cartpole problem has 4 observations: ‘cart position’, ‘cart velocity’, ‘pole angle’, and ‘pole velocity’. We will be completely disregarding ‘cart position’ and ‘cart velocity’, 200 timesteps won’t be enough for the cart to move far enough to end the episode. This leaves us with ‘pole angle’ and ‘pole velocity’ which will both be ‘discretised’ into 3 separate bins, creating 9 possible states of the environment.

The cartpole problem only has 2 actions, so with 9 states, we have a limit of 512 (2⁹) deterministic greedy policies. All we need now is a way to generate these policies then we can attempt to solve using brute force:

At each state-action pair, the cart will either move left or right (0 or 1). So, to create the policy, we simply generate the binary for every number from 0 to 511. The string then gets reshaped to form a 3 by 3 matrix for easier pipelining.

It would be computationally expensive to completely evaluate every policy, so, to start, we ran each policy on the problem once and kept track of those that scored above 195.

When we ran this script, we found that 34, of the initial 512, policies are potential solutions. To examine them further, we need to run these potential solutions for 100 episodes and average their performance.

When we ran this we found 8 solutions in 39.1 seconds. So, there we have it. We successfully brute-forced the cartpole problem!

#cartpole #openai-gym #control #brute-force #reinforcement-learning #deep learning

towardsdatascience.com

Brute Forcing The Cartpole Problem