Some of the new reinforcement learning algorithms are using a discounted sum of rewards somewhere in their reward function. I implemented an approach that dramatically decreased training time for my machine learning models.

Image for post

These implementations include a tf.scan function that loops through every value of the rewards to calculate its discounted sum.

discount_factor = 0.99
discounted_sum_rewards = tf.scan(lambda agg, x: discount_factor * agg + x, rewards, reverse=True)

Although this approach works for smaller batch sizes, I have found that it does not scale well. This is due to the tf.scan function having to go through every value in the rewards array to calculate the discounted sum. Additionally, there are only a finite number of significant figures that can be represented on a computer. Depending on the range of rewards, there may be an opportunity to limit the depth that the discounted rewards will propagate.

The Idea

A value will have a maximum number of bits that can be stored by the computer. Let’s call that n. This means the maximum decimal value, while keeping the exponent at 0, is represented by:

Image for post

The number of decimal places that this will occupy can be represented by:

Image for post

If all our rewards were the same value, then that’s all we would need, but usually this is not the case. For rewards that are different, we will need to make sure the discounted sum propagates further based on the range of rewards. We will apply the same logic as above and look at the differences in magnitude of the maximum and minimum reward. This could be written as:

Image for post

Image for post

or

Image for post

Lastly, we will need to find how many discounts will need to be applied to get that number of decimal places to 1. For example, let’s say we have exactly 8 significant figures, or 10⁸. We would need to reach 10^-8 so that $10⁸ * 10^-8 = 10⁰ = 1. In other words, given a discount of d, we can calculate the number of iterations with:

Image for post

or

Image for post

Notice that all the values in the fraction are using log base 10, so the base doesn’t matter.

#python #deep learning

Faster Discounted Sum of Rewards
1.75 GEEK