The Upper Confidence Bound (UCB) Bandit Algorithm

Overview

In this, the fourth part of our series on Multi-Armed Bandits, we’re going to take a look at the Upper Confidence Bound (UCB) algorithm that can be used to solve the bandit problem.

If you’re not already familiar with the bandit problem and its terminology you may want to first take a look at the earlier parts of this series, which are as follows:

[Part 1: Mathematical Framework and Terminology]
- all the basic information needed to get started
[Part 2: The Bandit Framework]
_- _a description of the code and test framework
[Part 3: Bandit Algorithms]
_- _[The Greedy Algorithm]
_- _[The Optimistic-Greedy Algorithm]
_- _[The Epsilon-Greedy Algorithm (ε-Greedy)]
_- _[Regret]

All code for the bandit algorithms and testing framework can be found on github: Multi_Armed_Bandits

Recap

Baby Robot is lost in the mall. Using Reinforcement Learning we want to help him find his way back to his mum. However, before he can even begin looking for her, he needs to recharge, from a set of power sockets that each give a slightly different amount of charge.

_Using the strategies from the __multi-armed bandit _problem we need to find the best socket, in the shortest amount of time, to allow Baby Robot to get charged up and on his way.

Image for post

Baby Robot has entered a charging room containing 5 different power sockets. Each of these sockets returns a slightly different amount of charge. We want to get Baby Robot charged up in the minimum amount of time, so we need to locate the best socket and then use it until charging is complete.

This is identical to the Multi-Armed Bandit problem except that, instead of looking for a slot machine that gives the best payout, we’re looking for a power socket that gives the most charge.

#baby-robot-guide #reinforcement-learning #multi-armed-bandit #machine-learning

Overview

Recap

towardsdatascience.com

The Upper Confidence Bound (UCB) Bandit Algorithm