In natural language processing, the state of the art is dominated by large Transformer models, which pose production challenges due to their size.
The 1.5 billion parameter GPT-2, for example, is ~6 GB fully trained, and requires GPUs for anything close to realtime latency. Google’s T5 has 11 billion parameters. Microsoft’s Turing-NLG has 17 billion. GPT-3 has 153 billion parameters.
For more context, I’d recommend Nick Walton’s write up regarding the challenges of scaling AI Dungeon, which serves realtime inference from a fine-tuned GPT-2.
Because of these challenges, model optimization is now a prime focus for machine learning engineers. Figuring out how to make these models smaller and faster is prerequisite to making them widely usable.
In this piece, I’m going to walk through a process for optimizing and deploying Transformer models using Hugging Face’s Transformers, ONNX, and Cortex. For comparison, I’ll be deploying both a vanilla pre-trained PyTorch BERT and an optimized ONNX version as APIs on AWS.
We’ll start by accessing a pre-trained BERT and converting it to ONNX. Why are we converting BERT from PyTorch to ONNX? For two reasons:
According to data released by Microsoft, they saw a 17x increase in CPU inference by optimizing the original 3 layer BERT in ONNX.
#artificial-intelligence #python #deep-learning #programming #machine-learning