An introduction to knowledge distillation. In this post, my goal is to introduce you to the fundamentals of knowledge distillation, which is an incredibly exciting idea, building on training a smaller network to approximate the large one.
If you have ever used a neural network to solve a complex problem, you know that they can be enormous in size, containing millions of parameters. For instance, the famous BERT model has about ~110 million.
To illustrate the point, this is the number of parameters for the most common architectures in NLP, as summarized in the recent State of AI Report 2020 by Nathan Benaich and Ian Hogarth.
The number of parameters in given architectures. Source: State of AI Report 2020 by Nathan Benaich and Ian Hogarth
In Kaggle competitions, the winner models are often ensembles, composed of several predictors. Although they can beat simple models by a large margin in terms of accuracy, their enormous computational costs make them utterly unusable in practice.
Is there any way to somehow leverage these powerful but massive models to train state of the art models, without scaling the hardware?
Currently, there are three main methods out there to compress a neural network while preserving the predictive performance:
and knowledge distillation.
In this post, my goal is to introduce you to the fundamentals of knowledge distillation, which is an incredibly exciting idea, building on training a smaller network to approximate the large one.
What is Knowledge Distillation?
Let’s imagine a very complex task, such as image classification for thousands of classes. Often, you can’t just slap on a ResNet50 and expect it to achieve 99% accuracy. So, you build an ensemble of models, balancing out the flaws of each one. Now you have a huge model, which, although performs excellently, there is no way to deploy it into production and get predictions in a reasonable time.
However, the model generalizes pretty well to the unseen data, so it is safe to trust its predictions. (I know, this might not be the case, but let’s just roll with the thought experiment for now.)
What if we use the predictions from the large and cumbersome model to train a smaller, so-called student model to approximate the big one?
This is knowledge distillation in essence, which was introduced in the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
In broad strokes, the process is the following.
Train a large model that performs and generalizes very well. This is called the teacher model.
Take all the data you have, and compute the predictions of the teacher model. The total dataset with these predictions is called the knowledge, and the predictions themselves are often referred to as soft targets. This is the knowledge distillation step.
Use the previously obtained knowledge to train the smaller network, called the student model.
To visualize the process, you can think of the following.
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant
PyTorch for Deep Learning | Data Science | Machine Learning | Python. PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning. Python is a very flexible language for programming and just like python, the PyTorch library provides flexible tools for deep learning.
Data Augmentation is a technique in Deep Learning which helps in adding value to our base dataset by adding the gathered information from various sources to improve the quality of data of an organisation.
In this article, I clarify the various roles of the data scientist, and how data science compares and overlaps with related fields such as machine learning, deep learning, AI, statistics, IoT, operations research, and applied mathematics.
PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning.