GPT-3 has dominated the NLP news cycle recently with its borderline magical performance in text generation, but for everyone without $1,000,000,000 of Azure compute credits there are still plenty of ways to experiment with language models on your own. Hugging Face is a free open source company focussing on NLP tooling and they provide one of the easiest ways of accessing pre-trained models and tokenizers for NLP experiments. In this article, I will share a method for fine tuning the 117M parameter GPT-2 model with a corpus of Magic the Gathering card flavour texts to create a flavour text generator. This will all be captured in a Colab notebook so you can copy and edit to create generators for your own tasks!


Starting Point

Generative language models require billions of data points and millions of dollars in compute power to train successfully from scratch. For example, GPT-3 cost an estimated $4.6 million dollars to train and 355 years of compute time. However, fine tuning many of these models for custom tasks is easily within reach to anyone with access to even a single GPU. For this project we will be using Colab, which comes with many common data science packages pre-installed, including PyTorch and free access to GPU resource.

First, we will install the Hugging Face transformers library, which will also fetch the excellent (and fast) tokenizers library. Although Hugging Face provide a resource for text datasets in their nlp library, I will be sourcing my own data for this project. If you don’t have a dataset or application in mind, the nlp library would provide an excellent starting place for easy data acquisiton.

Image for post

This will install the Hugging Face transformers library and the tokenizer dependencies.

The Hugging Face libraries will give us access to the GPT-2 model as well as it’s pretrained weights and biases, a configuration class, and a tokenizer to convert each word in our text dataset into a numerical representation to feed into the model for training. Tokenization is important as the models can’t work with text data directly so they need to be encoded into something more manageable. Below is an example of tokenization on some sample text to give a small representative example of what encoding provides.

#gpt-2 #language-model #python #hugging-face #pytorch

Fine Tuning GPT-2 for Magic the Gathering Flavour Text Generation
1.75 GEEK