Neural networks are very resource intensive algorithms. They not only incur significant computational costs, they also consume a lot of memory in addition.

Even though the commercially available computational resources increase day by day, optimizing the training and inference of deep neural networks is extremely important.

If we run our models in the cloud, we want to minimize the infrastructure costs and the carbon footprint. When we are running our models on the edge, network optimization becomes even more significant. If we have to run our models on smartphones or embedded devices, hardware limitations are immediately apparent.

Since more and more models move from the servers to the edge, reducing size and computational complexity is essential. One particular and fascinating technique is *quantization*, which replaces floating points with integers inside the network. In this post, we are going to see why they work and how can you do this in practice.

The fundamental idea behind quantization is that if we convert the weights and inputs into integer types, we consume less memory and on certain hardware, the calculations are faster.

However, there is a trade-off: with quantization, we can lose significant accuracy. We will dive into this later, but first let’s see *why* quantization works.

As you probably know, you can’t just simply store numbers in the memory, only ones and zeros. So, to properly keep numbers and use them for computation, we must encode them.

There are two fundamental representations: *integers* and *floating point numbers.*

**Integers** are represented with their form in base-2 numeral system. Depending on the number of digits used, an integer can take up several different sizes. The most important are

*int8*or*hort*(ranges from -128 to 127),*uint8*(ranges from 0 to 255),*int16*or*long*(ranges from -32768 to 32767),*uint16*(ranges from 0 to 65535).

If we would like to represent real numbers, we have to give up perfect precision. To give an example, the number _1/3 *can be written in decimal form as 0.33333…, with infinitely many digits*, _which

Essentially, a float is the scientific notation of the n

umber in the form

where the base is most frequently 2, but can be 10 also. (For our purposes, it doesn’t matter, but let’s assume it is 2.)

Similarly to integers, there are different types of floats. The most commonly used are

*half*or _float16 _(1 bit sign, 5 bit exponent, 10 bit significand, so**16 bits in total**),*single*or*float32*(1 bit sign, 8 bit exponent, 23 bit significand, so**32 bits in total**),*double*or*float64*(1 bit sign, 11 bit exponent, 52 bit significand, so**64 bits in total**).

If you try to add and multiply two numbers together in the scientific format, you can see that float arithmetic is slightly more involved than integer arithmetic. In practice, the speed of each calculation very much depends on the actual hardware. For instance, a modern CPU in a desktop machine does float arithmetic as fast as integer arithmetic. On the other hand, GPUs are more optimized towards single precision float calculations. (Since this is the most prevalent type for computer graphics.)

Without being completely precise, it can be said that using *int8* is typically faster than *float32*. However, *float32* is used by default for training and inference for neural networks. (If you have trained a network before and didn’t specify the types of parameters and inputs, it was most likely *float32*.)

So, how can you convert a network from *float32* to *int8*?

#machine-learning #deep-learning #ai #neural-networks #artificial-intelligence

2.90 GEEK