Reinforcement learning has gained tremendous popularity in the last decade with a series of successful real-world applications in robotics, games and many other fields.

In this article, I will provide a high-level structural overview of classic reinforcement learning algorithms. The discussion will be based on their similarities and differences in the intricacies of algorithms.

RL Basics

Let’s start with a quick refresher on some basic concepts. If you are already familiar with all the terms of RL, feel free to skip this section.

Reinforcement learning models are a type of state-based models that utilize the markov decision process(MDP). The basic elements of RL include:

Episode(rollout): playing out the whole sequence of state and action until reaching the terminate state;

Current state s (or s_t_): where the agent is current at;

Next state s’ (or s_t+1_): next state from the current state;

Action a: the action to take at state s;

Transition probability P(s’|s, a): the probability of reaching s’ if taking action at at state s_t_;

Policy π(s, a): a mapping from each state to an action that determines how the agent acts at each state. It can be either deterministic or stochastic

Reward r (or R(s, a)): a reward function that generates rewards for taking action a at state s;

Return G_t_: total future rewards at state s_t;_

Value V(s): expected return for starting from state s;


Q value Q(s, a): expected return for starting from state s and taking action a;

Bellman equation

According to the Bellman equation, the current value is equal to current reward plus the discounted(γ) value at the next step, following the policy π. It can also be expressed using the Q value as:

This is the theoretical core in most reinforcement learning algorithms.

Prediction vs. Control Tasks

There are two fundamental tasks of reinforcement learning: prediction and control.

In prediction tasks, we are given a policy and our goal is to evaluate it by estimating the value or Q value of taking actions following this policy.

In control tasks, we don’t know the policy, and the goal is to find the optimal policy that allows us to collect most rewards. In this article, we will only focus on control problems.

RL Algorithm Structure

Below is a graph I made to visualize the high-level structure of different types of algorithms. In the next few sections, we will delve into the intricacies of each type.

MDP World

In the MDP world, we have a mental model of how the world works, meaning that we know the MDP dynamics (transition P(s’|s,a) and reward function R(s, a)), so we can directly build a model using the Bellman equation.

Again, in control tasks our goal is to find a policy that gives us maximum rewards. To achieve it, we use dynamic programming.

Dynamic Programming (Iterative Methods)

1. Policy Iteration

Policy iteration essentially performs two steps repeatedly until convergence: policy evaluation and policy improvement.

In the policy evaluation step, we evaluate the policy **π **at state s by calculating the Q value using the Bellman equation:

In the policy improvement step, we update the policy by greedily searching for the action that maximizes the Q value at each step.

Let’s see how policy iteration works.

2. Value Iteration

Value iteration combines the two steps in policy iteration so we only need to update the Q value. We can interpret value iteration as always following a greedy policy because at each step it always tries to find and take the action that maximizes the value. Once the values converge, the optimal policy can be extracted from the value function.

In most real-world scenarios, we don’t know the MDP dynamics so the applications of iterative methods are limited. In the next section, we will switch gears and discuss reinforcement learning methods that can deal with the unknown world.

#monte-carlo #deep-q-learning #algorithms #algorithms

A Structural Overview of Reinforcement Learning Algorithms
1.60 GEEK