Problem(s) addressed

The authors start the paper with a very interesting analogy to explain the notion that the requirements for the training & inference could be very different.

The analogy given is that of a larva and its adult form and the fact the requirements of nourishments for the two forms are quite different.

We can easily appreciate that during training the priority is to solve the problem at hand. We end up employing a multitude of techniques and tricks to achieve the goal. i.e. the goal of learning parameters of the model.

For example, you could

  • use an ensemble of networks which is proven to work for many different kinds of problems
  • you could use dropout to generalize better
  • increase the depth of the network,
  • use a larger dataset etc

Also it is important to appreciate that during this quest to learn , the mechanics of machine learning is such that we will explore various paths which while crucial for learning may not be needed during the inference phase. In other words, this extra information could be considered redundant from inference perspective.

This brings us to the requirements for inference where along with accuracy, the runtime performance i.e. the speed of prediction plays an important role as well.

If your product is not usable because it is slow then however accurate it is, it would not matter. Usability wins over accuracy in most of the cases !

The paper aims to address the challenge of how to run accurate models using a network architecture with a smaller number of parameters without sacrificing too much accuracy.

Prior art and its limitations

This is not the first time the problem is being discussed. The notion of training simple networks that use the knowledge of cumbersome model was demonstrated by Rich Caruana et al in the year 2006 in a paper titled Model Compression.

A cumbersome model is the model which has lot of parameters or is an ensemble of models and is generally difficult to setup and run on devices with less computing resources.

In this paper, Hinton refers to the Model compression to give them the credit for proving that it is possible to extract the information from cumbersome models and provide it to the simpler model.

In Model Compression paper, the technique used was to minimize the distance in logits space using RMSE. This paper argues that they build on that insight and propose a more general solution; in other words, a Model Compression technique from Caruana et al is a specific case proposed by Hinton et al.

Required Background knowledge to understand the Key Insights

To appreciate the key insights from this paper, you should have a good intuition as well as a mathematical understanding of what softmax activation function does!

Here I am showing a typical classification network with 3 neurons in the last output layer. This means that we have 3 classes. The activation function used in typical classification problems is the softmax function (in the last layer). For our discussion, it does not matter what activation functions are used in the hidden layers.

#performance #softmax #deep-learning #knowledge-distillation #machine-learning #deep learning

[Paper Summary] Distilling the Knowledge in a Neural Network
1.10 GEEK