Everything GPT-2: 2. Architecture Comprehensive

Feel the burn …

The existing resources for GPT-2’s architecture are very good, but are written for researchers so I will provide you will a tailored concept map for all the areas you will need to know prior to jumping in.

Areas that the reader should already know, i.e. areas I won’t specify the resource for:

Linear Algebra — specifically matrix multiplication, vectors, and projections from one space onto another space with fewer dimensions
Statistics — specifically probability distributions
General Machine Learning Concepts — specifically supervised learning and unsupervised learning
Neural Nets — general information about how each part works
Training Neural Nets — general information about how training works specifically gradient descent, optimizers, back propagation, and updating weights.

Areas that the reader should learn before proceeding, i.e. areas I will specify a resource for:

Activation Functions** (1/2 hr)** — For each neuron, given an input and a weight (something to do to that input), it should have a way to decide whether to fire or not. Current best activation functions are softmax (typically only used for output layer), relu and swish, because they are efficient for computing gradients.
Softmax Function** (1 hr)** — The softmax function maps a vector of real numbers to a vector of probability distributions. On the linked page look at the intro and the examples sections.
Normalization** (1 hr)**—the act of controlling the mean and variance to make the learning (training) more effective, though the exact mechanics are not well understood. The intuition is that it makes the loss surface smoother and thus easier to navigate in a consistent way. There are different types of normalization including batch, layer, instance, group, and others. The transformer architecture uses layer normalization.
!Cross-Entropy Loss! (1/2 hr)— Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

#loss-function #machine-learning #self-attention #gpt-2

medium.com

Everything GPT-2: 2. Architecture Comprehensive