In natural language processing, the state of the art is dominated by large Transformer models, which pose production challenges due to their size.

The 1.5 billion parameter GPT-2, for example, is ~6 GB fully trained, and requires GPUs for anything close to realtime latency. Google’s T5 has 11 billion parameters. Microsoft’s Turing-NLG has 17 billion. GPT-3 has 153 billion parameters.

For more context, I’d recommend Nick Walton’s write up regarding the challenges of scaling AI Dungeon, which serves realtime inference from a fine-tuned GPT-2.

Because of these challenges, model optimization is now a prime focus for machine learning engineers. Figuring out how to make these models smaller and faster is prerequisite to making them widely usable.

In this piece, I’m going to walk through a process for optimizing and deploying Transformer models using Hugging Face’s Transformers, ONNX, and Cortex. For comparison, I’ll be deploying both a vanilla pre-trained PyTorch BERT and an optimized ONNX version as APIs on AWS.

Optimizing a Transformer model with Hugging Face and ONNX

We’ll start by accessing a pre-trained BERT and converting it to ONNX. Why are we converting BERT from PyTorch to ONNX? For two reasons:

  • ONNX Runtime’s built-in graph optimizations accelerate Transformer inference better than other popular optimizers, according to benchmarks.
  • ONNX Runtime is capable of more efficient quantization (reducing the size of a model by converting it to integers from floating point decimals).

According to data released by Microsoft, they saw a 17x increase in CPU inference by optimizing the original 3 layer BERT in ONNX.

#artificial-intelligence #python #deep-learning #programming #machine-learning

Why you should convert your NLP pipelines to ONNX
18.45 GEEK