Introduction

With the rapid progress in Machine Learning (ML) and Natural Language Processing (NLP), new algorithms are able to generate texts that seem more and more human-produced. One such algorithm, GPT2¹, has been used in many open-source applications². GPT2 was trained on WebText, which contains 45 million outbound links from Reddit (i.e. websites that comments reference). The top 10 outbound domains³ include GoogleArchiveBlogspotGithubNYTimesWordPressWashington PostWikiaBBC, and The Guardian. The pre-trained GPT2 model can be fine-tuned on specific datasets, for example, to “acquire” the style of a dataset or learn to classify documents. This is done via Transfer Learning, which can be defined as “a means to extract knowledge from a source setting and apply it to a different target setting”⁴. For a detailed explanation of GPT2 and its architecture see the original paper⁵, OpenAI’s blog post⁶, or Jay Alammar’s illustrated guide⁷.

Dataset

The dataset used to fine-tune GPT2 consists of the first 3 seasons of Rick and Morty transcripts. I filtered out all dialog that was not generated by Rick, Morty, Summer, Beth, or Jerry. The data was downloaded and stored in a raw text format. Each line represents a speaker and their utterance or an action/scene description. The dataset was split into training and test data, which contain 6905 and 1454 lines, respectively. The raw files can found here. The training data is used to fine-tune the model, while the test data is used for evaluation.

Training the model

Hugging Face’s Transformers library provides a simple script to fine-tune a custom GPT2 model. You can fine-tune your own model using this Google Colab notebook. Once your model has finished training, make sure you download the trained model output folder containing all relevant model files (this is essential to load the model later). You can upload your custom model on Hugging Face’s Model Hub⁸ to make it accessible to the public. The model achieves a perplexity score of around ~17 when evaluated on the test data.

#text-generation #editors-pick #language-modeling #nlp #machine-learning

Morty story generation with GPT2 using Transformers and Streamlit in 57 lines of code
3.20 GEEK