TernaryBERT cleverly pieces together existing quantization and distillation techniques. The paper heavily references prior work and is therefore quite dense. The goal of this article is to provide a self-contained walk-through, with additional context where needed.

The ongoing trend of building ever larger models like BERT and GPT-3 has been accompanied by a complementary effort to reduce their size at little or no cost in accuracy. Effective models are built either via distillation (Pre-trained Distillation, DistilBERT, MobileBERT, TinyBERT), quantization (Q-BERT, Q8BERT) or parameter pruning.

On September 27, Huawei introduced TernaryBERT, a model that leverages both distillation and quantization to achieve **accuracy comparable to the original BERT model with ~15x decrease in size**. What is truly remarkable about TernaryBERT is that its weights are *ternarized*, i.e. have one of three values: -1, 0, or 1 (and can hence be stored in only two bits).

