Over the last few articles, we’ve discussed and implemented deep Q-learning (DQN) in the VizDoom game environment and examined it’s performance. Deep Q-learning is a highly flexible and responsive online learning approach that utilizes rapid intra-episodic updates to it’s estimations of state-action (Q) values in an environment in order to maximize reward. Q-learning can be thought of as an off-policy approach to TD, where the algorithm aims to select state-action pairs of highest value independent of the current policy being followed, and has been associated with many of the original breakthroughs for the OpenAI Atari gym environments.

Gameplay of our vanilla DQN agent, trained over 500 episodes.

However, DQN’s have a tendency towards optimistic overestimation Q-values, particularly at the initial stages of training, leading to a risk of suboptimal action selection and hence slower convergence. To understand this problem, recall the Q-learning update equation, which utilizes the current state reward and the highest valued state-value pair to estimate the Q-value of the current state, which is used to train the DQN.

Q-learning update.

Note the presence of a TD-target within the error term, which consists of the sum of the current reward and the Q-value of the state-action pair of highest value, irrespective of the current policy of the agent — as such, Q-learning is often termed as off-policy TD learning.

Hence, Q-learning relies on the “foresight” of selecting a action with the highest value for the next state. But how can we be sure that** the best action for the next state is the action with the highest Q-value? ** By definition, the accuracy of Q-values depends on the state-actions we have previously explored.

#deep-learning #reinforcement-learning #doom #machine-learning #ai

Discovering Unconventional Strategies for Doom using Double Deep Q-learning
1.25 GEEK