Reinforcement learning has been at the center of many AI breakthroughs in recent years. The opportunity for algorithms to learn without the onerous constraints of data collection presents huge opportunities for key advancements. Google’s DeepMind has been at the center of Reinforcement Learning, featuring breakthroughs with projects that garnered national attention like AlphaZero, a self-trained competitive agent that became the best Go player in the world in a span of 4 days.¹

Traditional reinforcement learning algorithms such as Q-learning, SARSA, etc. work well in contained single agent environments, where they are able to continually explore until they find an optimal strategy. However, a key assumption of these algorithms is a stationary environment, meaning the transition probabilities and other factors remain unchanged episode to episode. When agents are trained against each other, such as in the case of poker, this assumption is impossible as both agents strategies are continually evolving leading to a dynamic environment. Furthermore, the algorithms above are deterministic in nature meaning that one action will always be considered optimal as compared to another action given a state.

Deterministic policies, however, do not hold for everyday life or poker. For example, when given an opportunity in poker, a player can bluff, meaning they represent better cards than they actually have by putting in an oversized bet meant to scare the other players into folding. However, if a player bluffs every time the opponents would recognize such a strategy and easily bankrupt the player. This leads to another class of algorithms called policy gradient algorithms which output a stochastic optimal policy that can then be sampled from.

Still, a large problem with traditional policy gradient methods is a lack of convergence due to dynamic environments as well as relatively low data efficiency. Luckily, numerous algorithms have come out in recent years that provide for a competitive self play environment that leads to optimal or near-optimal strategy such as Proximal Policy Optimization (PPO) published by OpenAI in 2017.² The uniqueness of PPO stems from the objective function which clips the probability ratio from the previous to the new model, encouraging small policy changes instead of drastic change.

Probability ratio from PPO paper

Objective function from PPO paper where A(t) is the advantage function

These methods have been applied successfully to numerous multi-player Atari games, so my hypothesis was that they could easily be adapted to heads up poker. In tournament poker the majority of winnings are concentrated in the winners circle, meaning that to make a profit, wins are much more important than simply “cashing” or making some money each time. A large portion of success in heads up poker is the decision to go all in or not, so in this simulation the agent had two options, fold or go all-in.

The rules of poker dictate a “small blind” and a “big blind” to start the betting, meaning that the small blind has to put in a set amount of chips and the big blind has to put in double that amount. Then cards are dealt and the players bet. The only parameters the agents were given were the following: what percentage chance they would win the current hand against a random heads up player, whether they were first to bet, and how much they had already bet.

#poker #policy-gradient #ppo #reinforcement-learning

1.25 GEEK