Online learning methods are a dynamic family of algorithms powering many of the latest achievements in reinforcement learning over the past decade. Belonging to the sample-based learning class of reinforcement learning approaches, online learning methods allow for the determination of state values simply through repeated observations, eliminating the need for explicit transition dynamics. Unlike their offline counterparts, online learning approaches such as Temporal Difference learning (TD), allow for the incremental updates of the values of states and actions during episode of agent-environment interaction, allowing for constant, incremental performance improvements to be observed.
Beyond TD we’ve discussed the theory and practical implementations of Q-learning, an evolution of TD designed to allow for incrementally more precise estimations state-action values in an environment. Q-learning has been made famous as becoming the backbone of reinforcement learning approaches to simulated game environments, such as those observed in OpenAI’s gyms. As we’ve already covered theoretical aspects of Q-learning in past articles, they will not be repeated here.
An agent playing the basic scenario, from our previous Tensorflow implementation
In our previous article, we explored how Q-learning can be applied to training an agent to play a basic scenario in the classic FPS game Doom, through the use of the open-source OpenAI gym wrapper library Vizdoomgym. We’ll build upon that article by introducing a more complex Vizdoomgym scenario, and build our solution in Pytorch. This is the first in a series of articles investigating various RL algorithms for Doom, serving as our baseline.
The environment we’ll be exploring is the Defend The Line-scenario of Vizdoomgym. The environment has the agent at one end of a hallway, with demons spawning at the other end. Some characteristics of the environment include:
Initial state of the Defend The Line scenario.
Implicitly, success in this environment requires balancing the multiple objectives: the ideal player must learn prioritize the brown monsters, which are able to damage the player upon spawning, while the pink monsters can be safely ignored for a period of time due to their travel time. This setup is in contrast to our previous Doom article, where single objectives were presented.
Our Google Colaboratory implementation is written in Python utilizing Pytorch, and can be found on the GradientCrescent Github. Our approach is based on the approach detailed in Tabor’s excellent Reinforcement Learning course. As the implementation for this approach is quite convoluted, let’s summarize the order of actions required:
Let’s start by importing all of the necessary packages, including the OpenAI and Vizdoomgym environments. We’ll also install the AV package necessary for Torchvision, which we’ll use for visualization. Note that the runtime must be restarted after installation is complete.
!sudo apt-get update
!sudo apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip
# Boost libraries
!sudo apt-get install libboost-all-dev
# Lua binding dependencies
!apt-get install liblua5.1-dev
!sudo apt-get install cmake libboost-all-dev libgtk2.0-dev libsdl2-dev python-numpy git
!git clone https://github.com/shakenes/vizdoomgym.git
!python3 -m pip install -e vizdoomgym/
!pip install av
Next, we initialize our environment scenario, inspect the observation space and action space, and visualize our environment…
import gym
import vizdoomgym
env = gym.make(‘VizdoomDefendLine-v0’)
n_outputs = env.action_space.n
print(n_outputs)
observation = env.reset()
import matplotlib.pyplot as plt
for i in range(22):
if i > 20:
print(observation.shape)
plt.imshow(observation)
plt.show()
observation, _, _, _ = env.step(1)
Next, we’ll define our preprocessing wrappers. These are classes that inherit from the OpenAI gym base class, overriding their methods and variables in order to implicitly provide all of our necessary preprocessing. We’ll start defining a wrapper to repeat every action for a number of frames, and perform an element-wise maxima in order to increase the intensity of any actions. You’ll notice a few tertiary arguments such as fire_first and no_ops — these are environment-specific, and of no consequence to us in Vizdoomgym.
class RepeatActionAndMaxFrame(gym.Wrapper):
#input: environment, repeat
#init frame buffer as an array of zeros in shape 2 x the obs space
def __init__(self, env=None, repeat=4, clip_reward=False, no_ops=0,
fire_first=False):
super(RepeatActionAndMaxFrame, self).__init__(env)
self.repeat = repeat
self.shape = env.observation_space.low.shape
self.frame_buffer = np.zeros_like((2, self.shape))
self.clip_reward = clip_reward
self.no_ops = no_ops
self.fire_first = fire_first
def step(self, action):
t_reward = 0.0
done = False
for i in range(self.repeat):
obs, reward, done, info = self.env.step(action)
if self.clip_reward:
reward = np.clip(np.array([reward]), -1, 1)[0]
t_reward += reward
idx = i % 2
self.frame_buffer[idx] = obs
if done:
break
max_frame = np.maximum(self.frame_buffer[0], self.frame_buffer[1])
return max_frame, t_reward, done, info
def reset(self):
obs = self.env.reset()
no_ops = np.random.randint(self.no_ops)+1 if self.no_ops > 0 else 0
for _ in range(no_ops):
_, _, done, _ = self.env.step(0)
if done:
self.env.reset()
#Fire first seems quite useless, probably meant for something like space invader
if self.fire_first:
assert self.env.unwrapped.get_action_meanings()[1] == ‘FIRE’
obs, _, _, _ = self.env.step(1)
self.frame_buffer = np.zeros_like((2,self.shape))
self.frame_buffer[0] = obs
return obs
Next, we define the preprocessing function for our observations. We’ll make our environment symmetrical by converting it into the Box space, swapping the channel integer to the front of our tensor, and resizing it to an area of (84,84) from its original (320,480) resolution. We’ll also greyscale our environment, and normalize the entire image by dividing by a constant.
#reinforcement-learning #doom #games #deep-learning #ai #deep learning