Paper Summary: Playing Atari with Deep Reinforcement Learning

This paper presents a deep reinforcement learning model that learns control policies directly from high-dimensional sensory inputs (raw pixels /video data). The deep learning model, created by DeepMind, consisted of a CNN trained with a variant of Q-learning. The model learned to play seven Atari 2600 games and the results showed that the algorithm outperformed all the previous approaches.

Introduction

The paper lists some of the challenges faced by Reinforcement Learning algorithms in comparison to other Deep Learning techniques. Unlike other DL algorithms that have a large amount of labeled training data, RL relies on scalar reward functions that are sparse and delayed. The reward for a particular action could be received after several thousand time-steps. This delay between input and output reward is in stark contrast to the direct association between input and target in supervised learning. In other DL approaches, the inputs are independent, whereas, in RL, the input is a sequence of highly correlated states. Also, in RL, the data distribution changes as the algorithm learns new behaviors.

This paper trains a CNN with a variant of Q-learning algorithm to overcome the drawbacks mentioned above and uses stochastic gradient descent to update the weights. The problem of correlated data and non-stationary distribution is handled using an experience replay mechanism that randomly samples previous transitions. The approach is tested on the Atari game environment which provides a training ground with high dimensional visual input. The goal is to create a neural network learning agent whose only input is the raw video input, the reward and terminal signals and a set of valid actions. It is noteworthy that the same neural network architecture and hyperparameters were used for all the games.

1. Background

In the proposed task, an agent interacts with the environment Ɛ (the Atari emulator) by performing an action at from a set of valid actions A — {1,…K}. Ɛ may be stochastic as the game environment is unpredictable. The agent can observe an image xtϵ R^d that represents a vector of raw pixels in the emulator and receives a reward rt depending on the action taken. The agent only sees the current screen xt which does not give a complete idea about the current state. Hence, the paper refers to a sequence of actions and observations that depict the state st of the system at time t as st= x1 + a1 + x2 + a2 …xt + at_._ This makes the machine learn in a more human-like manner. The goal is to maximize the future rewards that are discounted by a factor η.

The optimal value function is defined as Q*(s,a) = maxπ E [Rt| st = s; at = a; π ] that obeys the Bellman Equation. The optimal value function can be found through a value iteration approach, but since it is impractical, the authors propose a non-linear function approximator such as a neural network to estimate the optimal Q value. The Q network can be trained by minimizing the loss functions Li(Θi) that changes at every iteration i.

The Q-learning algorithm is model-free, meaning, it doesn’t need to learn the rules of the game, as opposed to the model-based approaches, where the rules of the game are embedded in a transition matrix. The Q-learning algorithm is also off-policy and not on-policy. In fact, the paper follows a more hybrid approach. In the game environment, the agent can take one of several possible actions. Each action is associated with a Q value. Instead of always selecting the action with the highest Q value (greedy approach /on-policy), the agent can choose other actions to see where they lead(off-policy) and ‘build an experience’. In this paper, the agent picks the greedy action with a probability of (1-e), where e represents the randomness of the choice. At first, the network begins ‘exploring’ the environment by choosing e very close to 1 and gradually reduces e to 0 when it starts ‘exploiting’ the environment.

The Atari game environment faces a ‘perceptual aliasing’ problem where a single frame at any given instant does not reveal any useful information about the state of the environment. The paper solves this problem by using the last 4 frames of history and stacks them to produce the input to the Q-function.

2. Related Work

The most closely related work to this paper is neural fitted Q- learning (NFQ) that optimizes a sequence of loss functions using the RPOP algorithm. But they used batch update which is computationally expensive. This paper uses stochastic gradient descent that has low constant cost per iteration and can scale to large datasets. NFQ used deep autoencoders to learn a low dimensional representation of the task. This paper’s novelty lies in using reinforcement learning end-to-end directly from raw visual inputs thus learning features that are directly relevant to discriminating actionA values.
A

3. Deep Reinforcement Learning architecture

The paper aims to connect a reinforcement learning algorithm to a deep neural network that directly takes in RGB images as input and processes it using SGD. This paper utilizes a technique called Experience Replay. The agent’s experience at each step is stored as et = (st, at, rt, st+1) in a dataset pooled over multiple episodes and is called replay memory. Q-learning updates are applied to a random batch of samples from the pool. This makes the training data samples more random and uncorrelated, thus making it more ‘stationary’ to the neural network as each new batch is filled with random strategy experiences.

Deep Q learning algorithm with Experience Replay has the following advantages –

· Each step of the experience is potentially used in many weight updates, which allows for greater data efficiency.

· By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

The proposed algorithm has the limitation that it only stores the last n experience tuples in memory. A better approach would be to emphasize more on the transitions from which the agent can learn the most.

The input image is 210x160, which is large and computationally expensive. To simplify, the raw image is pre-processed by reducing it to greyscale with dimension 110x84. The image is cropped to 84x84 as they use GPU implementations of 2D convolution which accepts only square inputs. The last four frames in history are pre-processed and stacked to produce the input to the Q-function.

Instead of giving a (state, action) pair as the input to the neural network, in this architecture, only the state representation is given as input and separate output units are present for each possible action for the given state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.

The input to the neural network consists of an 84x 84x 4 image subjected to the following DQN structure -

· 16 8x8 filters with stride 4, ReLu activation

· 32 4x4 filters with stride 2, ReLu activation

· a fully connected layer, 256 rectified units

#q-learning #deep-reinforcement #deep-learning #atari #dqn #deep learning

What is GEEK

Buddha Community

Paper Summary: Playing Atari with Deep Reinforcement Learning

Eran Feit

1643185791

Please watch my new twist of playing Atari :)
https://youtu.be/vJO6nMwUBXw

 

Paper Summary: Playing Atari with Deep Reinforcement Learning

This paper presents a deep reinforcement learning model that learns control policies directly from high-dimensional sensory inputs (raw pixels /video data). The deep learning model, created by DeepMind, consisted of a CNN trained with a variant of Q-learning. The model learned to play seven Atari 2600 games and the results showed that the algorithm outperformed all the previous approaches.

Introduction

The paper lists some of the challenges faced by Reinforcement Learning algorithms in comparison to other Deep Learning techniques. Unlike other DL algorithms that have a large amount of labeled training data, RL relies on scalar reward functions that are sparse and delayed. The reward for a particular action could be received after several thousand time-steps. This delay between input and output reward is in stark contrast to the direct association between input and target in supervised learning. In other DL approaches, the inputs are independent, whereas, in RL, the input is a sequence of highly correlated states. Also, in RL, the data distribution changes as the algorithm learns new behaviors.

This paper trains a CNN with a variant of Q-learning algorithm to overcome the drawbacks mentioned above and uses stochastic gradient descent to update the weights. The problem of correlated data and non-stationary distribution is handled using an experience replay mechanism that randomly samples previous transitions. The approach is tested on the Atari game environment which provides a training ground with high dimensional visual input. The goal is to create a neural network learning agent whose only input is the raw video input, the reward and terminal signals and a set of valid actions. It is noteworthy that the same neural network architecture and hyperparameters were used for all the games.

1. Background

In the proposed task, an agent interacts with the environment Ɛ (the Atari emulator) by performing an action at from a set of valid actions A — {1,…K}. Ɛ may be stochastic as the game environment is unpredictable. The agent can observe an image xtϵ R^d that represents a vector of raw pixels in the emulator and receives a reward rt depending on the action taken. The agent only sees the current screen xt which does not give a complete idea about the current state. Hence, the paper refers to a sequence of actions and observations that depict the state st of the system at time t as st= x1 + a1 + x2 + a2 …xt + at_._ This makes the machine learn in a more human-like manner. The goal is to maximize the future rewards that are discounted by a factor η.

The optimal value function is defined as Q*(s,a) = maxπ E [Rt| st = s; at = a; π ] that obeys the Bellman Equation. The optimal value function can be found through a value iteration approach, but since it is impractical, the authors propose a non-linear function approximator such as a neural network to estimate the optimal Q value. The Q network can be trained by minimizing the loss functions Li(Θi) that changes at every iteration i.

The Q-learning algorithm is model-free, meaning, it doesn’t need to learn the rules of the game, as opposed to the model-based approaches, where the rules of the game are embedded in a transition matrix. The Q-learning algorithm is also off-policy and not on-policy. In fact, the paper follows a more hybrid approach. In the game environment, the agent can take one of several possible actions. Each action is associated with a Q value. Instead of always selecting the action with the highest Q value (greedy approach /on-policy), the agent can choose other actions to see where they lead(off-policy) and ‘build an experience’. In this paper, the agent picks the greedy action with a probability of (1-e), where e represents the randomness of the choice. At first, the network begins ‘exploring’ the environment by choosing e very close to 1 and gradually reduces e to 0 when it starts ‘exploiting’ the environment.

The Atari game environment faces a ‘perceptual aliasing’ problem where a single frame at any given instant does not reveal any useful information about the state of the environment. The paper solves this problem by using the last 4 frames of history and stacks them to produce the input to the Q-function.

2. Related Work

The most closely related work to this paper is neural fitted Q- learning (NFQ) that optimizes a sequence of loss functions using the RPOP algorithm. But they used batch update which is computationally expensive. This paper uses stochastic gradient descent that has low constant cost per iteration and can scale to large datasets. NFQ used deep autoencoders to learn a low dimensional representation of the task. This paper’s novelty lies in using reinforcement learning end-to-end directly from raw visual inputs thus learning features that are directly relevant to discriminating actionA values.
A

3. Deep Reinforcement Learning architecture

The paper aims to connect a reinforcement learning algorithm to a deep neural network that directly takes in RGB images as input and processes it using SGD. This paper utilizes a technique called Experience Replay. The agent’s experience at each step is stored as et = (st, at, rt, st+1) in a dataset pooled over multiple episodes and is called replay memory. Q-learning updates are applied to a random batch of samples from the pool. This makes the training data samples more random and uncorrelated, thus making it more ‘stationary’ to the neural network as each new batch is filled with random strategy experiences.

Deep Q learning algorithm with Experience Replay has the following advantages –

· Each step of the experience is potentially used in many weight updates, which allows for greater data efficiency.

· By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

The proposed algorithm has the limitation that it only stores the last n experience tuples in memory. A better approach would be to emphasize more on the transitions from which the agent can learn the most.

The input image is 210x160, which is large and computationally expensive. To simplify, the raw image is pre-processed by reducing it to greyscale with dimension 110x84. The image is cropped to 84x84 as they use GPU implementations of 2D convolution which accepts only square inputs. The last four frames in history are pre-processed and stacked to produce the input to the Q-function.

Instead of giving a (state, action) pair as the input to the neural network, in this architecture, only the state representation is given as input and separate output units are present for each possible action for the given state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.

The input to the neural network consists of an 84x 84x 4 image subjected to the following DQN structure -

· 16 8x8 filters with stride 4, ReLu activation

· 32 4x4 filters with stride 2, ReLu activation

· a fully connected layer, 256 rectified units

#q-learning #deep-reinforcement #deep-learning #atari #dqn #deep learning

Marget D

Marget D

1618317562

Top Deep Learning Development Services | Hire Deep Learning Developer

View more: https://www.inexture.com/services/deep-learning-development/

We at Inexture, strategically work on every project we are associated with. We propose a robust set of AI, ML, and DL consulting services. Our virtuoso team of data scientists and developers meticulously work on every project and add a personalized touch to it. Because we keep our clientele aware of everything being done associated with their project so there’s a sense of transparency being maintained. Leverage our services for your next AI project for end-to-end optimum services.

#deep learning development #deep learning framework #deep learning expert #deep learning ai #deep learning services

Tia  Gottlieb

Tia Gottlieb

1598250000

Paper Summary: Discovering Reinforcement Learning Agents

Introduction

Although the field of deep learning is evolving extremely fast, unique research with the potential to get us closer to Artificial General Intelligence (AGI) is rare and hard to find. One exception to this rule can be found in the field of meta-learning. Recently, meta-learning has also been applied to Reinforcement Learning (RL) with some success. The paper “Discovering Reinforcement Learning Agents” by Oh et al. from DeepMind provides a new and refreshing look at the application of meta-learning to RL.

**Traditionally, RL relied on hand-crafted algorithms **such as Temporal Difference learning (TD-learning) and Monte Carlo learning, various Policy Gradient methods, or combinations thereof such as Actor-Critic models. These RL algorithms are usually finely adjusted to train models for a very specific task such as playing Go or Dota. One reason for this is that multiple hyperparameters such as the discount factor γ and the bootstrapping parameter λ need to be tuned for stable training. Furthermore, the very update rules as well as the choice of predictors such as value functions need to be chosen diligently to ensure good performance of the model. The entire process has to be performed manually and is often tedious and time-consuming.

DeepMind is trying to change this with their latest publication. In the paper, the authors propose a new meta-learning approach that discovers the learning objective as well as the exploration procedure by interacting with a set of simple environments. They call the approach the Learned Policy Gradient (LPG). The most appealing result of the paper is that the algorithm is able to effectively generalize to more complex environments, suggesting the potential to discover novel RL frameworks purely by interaction.

In this post, I will try to explain the paper in detail and provide additional explanation where I had problems with understanding. Hereby, I will stay close to the structure of the paper in order to allow you to find the relevant parts in the original text if you want to get additional details. Let’s dive in!

#meta-learning #reinforcement-learning #machine-learning #ai #deep-learning #deep learning

Larry  Kessler

Larry Kessler

1617355640

Attend The Full Day Hands-On Workshop On Reinforcement Learning

The Association of Data Scientists (AdaSci), a global professional body of data science and ML practitioners, is holding a full-day workshop on building games using reinforcement learning on Saturday, February 20.

Artificial intelligence systems are outperforming humans at many tasks, starting from driving cars, recognising images and objects, generating voices to imitating art, predicting weather, playing chess etc. AlphaGo, DOTA2, StarCraft II etc are a study in reinforcement learning.

Reinforcement learning enables the agent to learn and perform a task under uncertainty in a complex environment. The machine learning paradigm is currently applied to various fields like robotics, pattern recognition, personalised medical treatment, drug discovery, speech recognition, and more.

With an increase in the exciting applications of reinforcement learning across the industries, the demand for RL experts has soared. Taking the cue, the Association of Data Scientists, in collaboration with Analytics India Magazine, is bringing an extensive workshop on reinforcement learning aimed at developers and machine learning practitioners.

#ai workshops #deep reinforcement learning workshop #future of deep reinforcement learning #reinforcement learning #workshop on a saturday #workshop on deep reinforcement learning

Tia  Gottlieb

Tia Gottlieb

1595573880

Deep Reinforcement Learning for Video Games Made Easy

In this post, we will investigate how easily we can train a Deep Q-Network (DQN) agent (Mnih et al., 2015) for Atari 2600 games using the Google reinforcement learning library Dopamine. While many RL libraries exist, this library is specifically designed with four essential features in mind:

  • Easy experimentation
  • Flexible development
  • Compact and reliable
  • Reproducible

_We believe these principles makes __Dopamine _one of the best RL learning environment available today. Additionally, we even got the library to work on Windows, which we think is quite a feat!

In my view, the visualization of any trained RL agent is an absolute must in reinforcement learning! Therefore, we will (of course) include this for our own trained agent at the very end!

We will go through all the pieces of code required (which is** minimal compared to other libraries**), but you can also find all scripts needed in the following Github repo.

1. Brief Introduction to Reinforcement Learning and Deep Q-Learning

The general premise of deep reinforcement learning is to

“derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.”

  • Mnih et al. (2015)

As stated earlier, we will implement the DQN model by Deepmind, which only uses raw pixels and game score as input. The raw pixels are processed using convolutional neural networks similar to image classification. The primary difference lies in the objective function, which for the DQN agent is called the optimal action-value function

Image for post

where_ rₜ is the maximum sum of rewards at time t discounted by γ, obtained using a behavior policy π = P(a_∣_s)_ for each observation-action pair.

There are relatively many details to Deep Q-Learning, such as Experience Replay (Lin, 1993) and an _iterative update rule. _Thus, we refer the reader to the original paper for an excellent walk-through of the mathematical details.

One key benefit of DQN compared to previous approaches at the time (2015) was the ability to outperform existing methods for Atari 2600 games using the same set of hyperparameters and only pixel values and game score as input, clearly a tremendous achievement.

2. Installation

This post does not include instructions for installing Tensorflow, but we do want to stress that you can use both the CPU and GPU versions.

Nevertheless, assuming you are using Python 3.7.x, these are the libraries you need to install (which can all be installed via pip):

tensorflow-gpu=1.15   (or tensorflow==1.15  for CPU version)
cmake
dopamine-rl
atari-py
matplotlib
pygame
seaborn
pandas

#reinforcement-learning #q-learning #games #machine-learning #deep-learning #deep learning