The existing resources for GPT-2’s architecture are very good, but are written for researchers so I will provide you will a tailored concept map for all the areas you will need to know prior to jumping in.

Feel the burn …

The existing resources for GPT-2’s architecture are very good, but are written for researchers so I will provide you will a tailored concept map for all the areas you will need to know prior to jumping in.

Areas that the reader should already know, i.e. areas I won’t specify the resource for:

- Linear Algebra — specifically matrix multiplication, vectors, and projections from one space onto another space with fewer dimensions
- Statistics — specifically probability distributions
- General Machine Learning Concepts — specifically supervised learning and unsupervised learning
- Neural Nets — general information about how each part works
- Training Neural Nets — general information about how training works specifically gradient descent, optimizers, back propagation, and updating weights.

Areas that the reader should learn before proceeding, i.e. areas I will specify a resource for:

**Activation Functions**** (1/2 hr)** — For each neuron, given an input and a weight (something to do to that input), it should have a way to decide whether to fire or not. Current best activation functions are softmax (typically only used for output layer), relu and swish, because they are efficient for computing gradients.**Softmax Function**** (1 hr)** — The softmax function maps a vector of real numbers to a vector of probability distributions. On the linked page look at the intro and the examples sections.**Normalization**** (1 hr)**—the act of controlling the mean and variance to make the learning (training) more effective, though the exact mechanics are not well understood. The intuition is that it makes the loss surface smoother and thus easier to navigate in a consistent way. There are different types of normalization including batch, layer, instance, group, and others. The transformer architecture uses layer normalization.**!Cross-Entropy Loss!****(1/2 hr)**— Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

For the calculation of Loss, various optimization techniques are used in the field of Machine learning and Deep learning. This article will cover commonly used loss function in Machine learning and Deep learning, its use and mathematics behind it.

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA

You got intrigued by the machine learning world and wanted to get started as soon as possible, read all the articles, watched all the videos, but still isn’t sure about where to start, welcome to the club.