A Reinforcement Learning Implementation in Pytorch.

Introduction

Image for post

Online learning methods are a dynamic family of algorithms powering many of the latest achievements in reinforcement learning over the past decade. Belonging to the sample-based learning class of reinforcement learning approaches, online learning methods allow for the determination of state values simply through repeated observations, eliminating the need for explicit transition dynamics. Unlike their offline counterpartsonline learning approaches such as Temporal Difference learning (TD), allow for the incremental updates of the values of states and actions during episode of agent-environment interaction, allowing for constant, incremental performance improvements to be observed.

Beyond TD we’ve discussed the theory and practical implementations of Q-learning, an evolution of TD designed to allow for incrementally more precise estimations state-action values in an environment. Q-learning has been made famous as becoming the backbone of reinforcement learning approaches to simulated game environments, such as those observed in OpenAI’s gyms. As we’ve already covered theoretical aspects of Q-learning in past articles, they will not be repeated here.

An agent playing the basic scenario, from our previous Tensorflow implementation

In our previous article, we explored how Q-learning can be applied to training an agent to play a basic scenario in the classic FPS game Doom, through the use of the open-source OpenAI gym wrapper library Vizdoomgym. We’ll build upon that article by introducing a more complex Vizdoomgym scenario, and build our solution in Pytorch. This is the first in a series of articles investigating various RL algorithms for Doom, serving as our baseline.

Implementation

The environment we’ll be exploring is the Defend The Line-scenario of Vizdoomgym. The environment has the agent at one end of a hallway, with demons spawning at the other end. Some characteristics of the environment include:

  • An action space of 3: fire, turn left, and turn right. Strafing is not allowed.
  • Brown monsters that shoot fireballs at the player with a 100% hit rate.
  • Pink monsters that attempt to move close in a zig-zagged pattern to bite the player.
  • +15 points for killing 16 monsters.
  • +1 point for killing a monster
    • 1 for dying.

Image for post

Initial state of the Defend The Line scenario.

Implicitly, success in this environment requires balancing the multiple objectives: the ideal player must learn prioritize the brown monsters, which are able to damage the player upon spawning, while the pink monsters can be safely ignored for a period of time due to their travel time. This setup is in contrast to our previous Doom article, where single objectives were presented.

Our Google Colaboratory implementation is written in Python utilizing Pytorch, and can be found on the GradientCrescent Github. Our approach is based on the approach detailed in Tabor’s excellent Reinforcement Learning course. As the implementation for this approach is quite convoluted, let’s summarize the order of actions required:

  1. We define the preprocessing functions needed to maximize performance, and introduce them as wrappers for our gym environment for automation. These focus on capturing the motion of the environment through the use of elemenwise-maxima, and frame stacking.
  2. We define our Deep Q-learning neural network. This is a CNN that takes in-game screen images and outputs the probabilities of each of the actions, or Q-values, in the Ms-Pacman gamespace. To acquire a tensor of probabilitieses, we do not include any activation function in our final layer.
  3. As Q-learning require us to have knowledge of both the current and next states, we need to start with data generation. We feed preprocessed input images of the game space, representing initial states s, into the network, and acquire the initial probability distribution of actions, or Q-values. Before training, these values will appear random and sub-optimal.
  4. With our tensor of probabilities, we then select the action with the current highest probability using the argmax() function, and use it to build an epsilon greedy policy.
  5. Using our policy, we’ll then select the action a, and evaluate our decision in the gym environment to receive information on the new state s’, the reward r, and whether the episode has been finished.
  6. We store this combination of information in a buffer in the list form <s,a,r,s’,d>, and repeat steps 2–4 for a preset number of times to build up a large enough buffer dataset.
  7. Generate our target y-values, _R’ _and A’, that are required for the loss calculation. While the former is simply discounted from R, we obtain the A’ by feeding _S’ _into our network.
  8. With all of our components in place, we can then calculate the loss to train our network.
  9. Once training has finished, we’ll evaluate the performance of our agent under a new game episode, and record the performance

Let’s start by importing all of the necessary packages, including the OpenAI and Vizdoomgym environments. We’ll also install the AV package necessary for Torchvision, which we’ll use for visualization. Note that the runtime must be restarted after installation is complete.

!sudo apt-get update

!sudo apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip
# Boost libraries
!sudo apt-get install libboost-all-dev
# Lua binding dependencies
!apt-get install liblua5.1-dev
!sudo apt-get install cmake libboost-all-dev libgtk2.0-dev libsdl2-dev python-numpy git
!git clone https://github.com/shakenes/vizdoomgym.git
!python3 -m pip install -e vizdoomgym/
!pip install av

Next, we initialize our environment scenario, inspect the observation space and action space, and visualize our environment…

import gym

import vizdoomgym
env = gym.make(‘VizdoomDefendLine-v0’)
n_outputs = env.action_space.n
print(n_outputs)
observation = env.reset()
import matplotlib.pyplot as plt
for i in range(22):
  if i > 20:
    print(observation.shape)
    plt.imshow(observation)
    plt.show()
    observation, _, _, _ = env.step(1)

Next, we’ll define our preprocessing wrappers. These are classes that inherit from the OpenAI gym base class, overriding their methods and variables in order to implicitly provide all of our necessary preprocessing. We’ll start defining a wrapper to repeat every action for a number of frames, and perform an element-wise maxima in order to increase the intensity of any actions. You’ll notice a few tertiary arguments such as fire_first and no_ops — these are environment-specific, and of no consequence to us in Vizdoomgym.

class RepeatActionAndMaxFrame(gym.Wrapper):
 #input: environment, repeat
 #init frame buffer as an array of zeros in shape 2 x the obs space
 def __init__(self, env=None, repeat=4, clip_reward=False, no_ops=0,
 fire_first=False):
 super(RepeatActionAndMaxFrame, self).__init__(env)
 self.repeat = repeat
 self.shape = env.observation_space.low.shape
 self.frame_buffer = np.zeros_like((2, self.shape))
 self.clip_reward = clip_reward
 self.no_ops = no_ops
 self.fire_first = fire_first
def step(self, action):
 t_reward = 0.0
 done = False
 for i in range(self.repeat):
 obs, reward, done, info = self.env.step(action)
 if self.clip_reward:
 reward = np.clip(np.array([reward]), -1, 1)[0]
 t_reward += reward
 idx = i % 2
 self.frame_buffer[idx] = obs
 if done:
 break
 max_frame = np.maximum(self.frame_buffer[0], self.frame_buffer[1])
 return max_frame, t_reward, done, info
def reset(self):
 obs = self.env.reset()
 no_ops = np.random.randint(self.no_ops)+1 if self.no_ops > 0 else 0
 for _ in range(no_ops):
 _, _, done, _ = self.env.step(0)
 if done:
 self.env.reset()
 #Fire first seems quite useless, probably meant for something like space invader
 if self.fire_first:
 assert self.env.unwrapped.get_action_meanings()[1] == ‘FIRE’
 obs, _, _, _ = self.env.step(1)
 self.frame_buffer = np.zeros_like((2,self.shape))
 self.frame_buffer[0] = obs
 return obs

Next, we define the preprocessing function for our observations. We’ll make our environment symmetrical by converting it into the Box space, swapping the channel integer to the front of our tensor, and resizing it to an area of (84,84) from its original (320,480) resolution. We’ll also greyscale our environment, and normalize the entire image by dividing by a constant.

#reinforcement-learning #doom #games #deep-learning #ai #deep learning

Multi-objective optimization with Deep Q-learning
4.10 GEEK