Outline

  • Recap of major takeaways from Part 1
  • Introduction to GPT-2
  • Build a simple language model
  • Scrape Twitter data
  • Specify the architecture for GPT
  • Introduce a BytePairEncoder

In the first part of this tutorial series, we talked about language models in general; the statistical nature of language, and how common statistical models can not capture the necessary information in language to reliably process it.

Some basic knowledge from part 1 is necessary in order to go through this session, so I’d recommend checking it out if you need the primer.

Basic Knowledge

  • A transformer model is an encoder-decoder model that makes use of attention mechanisms and linear layers for its operations.
  • Positional encoding is used to infuse the idea of position into the model.
  • Subsequent masking is used to prevent the decoder from foreseeing future words.
  • Self-attention in the decoder — self-attention allows the decoder model to know how words depend on each other in a sentence.

These pieces of knowledge will be primary building blocks we rely on in this article.

In this session, we’ll be moving from the transformer model for sequence-to-sequence, to a word-predictive model (GPT-2).

What are word-predictive models? In the part one, we also discussed how to build a model that predicts the next word. But we approached the idea from the neural machine translation (NMT) perspective.

The best way to illustrate a word-predictive model is through the use of your smart mobile keyboard, which suggests the next word to type.

#gpt-2 #nlp #transformers #machine-learning #heartbeat #deep learning

 From Transformers to GPT-2
2.55 GEEK