From Transformers to GPT-2

Outline

Recap of major takeaways from Part 1
Introduction to GPT-2
Build a simple language model
Scrape Twitter data
Specify the architecture for GPT
Introduce a BytePairEncoder

In the first part of this tutorial series, we talked about language models in general; the statistical nature of language, and how common statistical models can not capture the necessary information in language to reliably process it.

Some basic knowledge from part 1 is necessary in order to go through this session, so I’d recommend checking it out if you need the primer.

Basic Knowledge

A transformer model is an encoder-decoder model that makes use of attention mechanisms and linear layers for its operations.
Positional encoding is used to infuse the idea of position into the model.
Subsequent masking is used to prevent the decoder from foreseeing future words.
Self-attention in the decoder — self-attention allows the decoder model to know how words depend on each other in a sentence.

These pieces of knowledge will be primary building blocks we rely on in this article.

In this session, we’ll be moving from the transformer model for sequence-to-sequence, to a word-predictive model (GPT-2).

What are word-predictive models? In the part one, we also discussed how to build a model that predicts the next word. But we approached the idea from the neural machine translation (NMT) perspective.

The best way to illustrate a word-predictive model is through the use of your smart mobile keyboard, which suggests the next word to type.

#gpt-2 #nlp #transformers #machine-learning #heartbeat #deep learning

Outline

Basic Knowledge

heartbeat.fritz.ai

From Transformers to GPT-2