# Solving Open AI’s CartPole using Reinforcement Learning Part-2

In the first tutorial, I introduced the most basic Reinforcement learning method called Q-learning to solve the CartPole problem. Because of its computational limitations, it is working in simple environments, where the number of states and possible actions is relatively small.

In the first tutorial, I introduced the most basic Reinforcement learning method called Q-learning to solve the CartPole problem. Because of its computational limitations, it is working in simple environments, where the number of states and possible actions is relatively small. Calculating, storing, and updating Q-values for each action in the more complex environment is either impossible or highly inefficient. This is where the Deep Q-Network comes into play.

## Background Information

The Deep Q-Learning has been introduced in 2013 inPlaying Atari with Deep Reinforcement Learning paper by the DeepMind team. The first similar approach was made in 1992 using TD-gammon. The algorithm achieved a superhuman level of playing backgammon, but the method didn’t apply to games like chess, go, or checkers. DeepMind was able to surpass human performance in 3 out of 7 Atari games, using raw images and the same hyperparameters for all games. This was a breakthrough in the area of more general learning.

The basic idea is of DQN is that it combines Q-learning with deep learning. We get rid of Q-table, and use neural networks instead, to approximate the action-value function(Q(s,a)). The states are passed to the network, and as an output, we receive the estimated Q-values for each action.

In order to train the network, we need a target value, also known as a ground truth. The question is how we evaluate the loss function without actually having a labeled dataset?

Well, we create target values on the run using the Bellman equation.

This method is called bootstrapping, we are trying to estimate something based on another estimation. Essentially we are estimating the current action value Q(s,a) by using an estimation of the future Q(s’,a). The problem arises when one network is used to predict both values. It is similar to the dog catching his own tail. Weights are updated to move predictions closer to the target Q-values, but target values will also be moving forward, cause we use the same network.

The solution has been presented in the DeepMind paper Human-level control through deep reinforcement learning. *The idea is that we use a separate network to predict target values. Every C time step, weights from the policy network are copied to the *target network. It provides more stability to the algorithm since our network is not trying to chase a nonstationary target.

In order to make a neural network works we need four values state(S), action(A), reward(R), future state(S’). These values are stored in a replay memory vector and then randomly sampled to train. This process is called experience replay and has been also introduced by DeepMind.

## How are deep learning, artificial intelligence and machine learning related

What is the difference between machine learning and artificial intelligence and deep learning? Supervised learning is best for classification and regressions Machine Learning models. You can read more about them in this article.