How to Accelerate and Compress Neural Networks With Quantization

Neural networks are very resource intensive algorithms. They not only incur significant computational costs, they also consume a lot of memory in addition.

Even though the commercially available computational resources increase day by day, optimizing the training and inference of deep neural networks is extremely important.

If we run our models in the cloud, we want to minimize the infrastructure costs and the carbon footprint. When we are running our models on the edge, network optimization becomes even more significant. If we have to run our models on smartphones or embedded devices, hardware limitations are immediately apparent.

Since more and more models move from the servers to the edge, reducing size and computational complexity is essential. One particular and fascinating technique is quantization, which replaces floating points with integers inside the network. In this post, we are going to see why they work and how can you do this in practice.

Quantization

The fundamental idea behind quantization is that if we convert the weights and inputs into integer types, we consume less memory and on certain hardware, the calculations are faster.

However, there is a trade-off: with quantization, we can lose significant accuracy. We will dive into this later, but first let’s see why quantization works.

Integer vs floating point arithmetic

As you probably know, you can’t just simply store numbers in the memory, only ones and zeros. So, to properly keep numbers and use them for computation, we must encode them.

There are two fundamental representations: integers and floating point numbers.

Integers are represented with their form in base-2 numeral system. Depending on the number of digits used, an integer can take up several different sizes. The most important are

int8 or hort(ranges from -128 to 127),
uint8 (ranges from 0 to 255),
int16 or long (ranges from -32768 to 32767),
uint16 (ranges from 0 to 65535).

If we would like to represent real numbers, we have to give up perfect precision. To give an example, the number _1/3 can be written in decimal form as 0.33333…, with infinitely many digits, _which cannot be represented in the memory. To handle this, floating-point numbers were introduced.

Essentially, a float is the scientific notation of the n

umber in the form

Image for post

where the base is most frequently 2, but can be 10 also. (For our purposes, it doesn’t matter, but let’s assume it is 2.)

Image for post
Similarly to integers, there are different types of floats. The most commonly used are

half or _float16 _(1 bit sign, 5 bit exponent, 10 bit significand, so 16 bits in total),
single or float32(1 bit sign, 8 bit exponent, 23 bit significand, so 32 bits in total),
double or float64(1 bit sign, 11 bit exponent, 52 bit significand, so 64 bits in total).

If you try to add and multiply two numbers together in the scientific format, you can see that float arithmetic is slightly more involved than integer arithmetic. In practice, the speed of each calculation very much depends on the actual hardware. For instance, a modern CPU in a desktop machine does float arithmetic as fast as integer arithmetic. On the other hand, GPUs are more optimized towards single precision float calculations. (Since this is the most prevalent type for computer graphics.)

Without being completely precise, it can be said that using int8 is typically faster than float32. However, float32 is used by default for training and inference for neural networks. (If you have trained a network before and didn’t specify the types of parameters and inputs, it was most likely float32.)

So, how can you convert a network from float32 to int8?

#machine-learning #deep-learning #ai #neural-networks #artificial-intelligence

Quantization

Integer vs floating point arithmetic

towardsdatascience.com

How to Accelerate and Compress Neural Networks With Quantization