Transformer’s architecture has been the cornerstone for the development of many of the latest SOTA NLP models. It mainly relies on a mechanism called attention. Unlike other successful models that came before, it has no involvement with convolutional or recurrent layers what so ever.

If you’re new to this model, chances are you won’t find this architecture to be easiest to understand. If that’s the case, I hope this article can help.

We’ll start the explanation with how a regular encoder-decoder network works and what difficulties it may encounter, what is an attention mechanism used for in a regular encoder-decoder architecture, and finally, how it’s used in Transformers.

An Encoder-Decoder Network for Neural Machine Translation

Encoder-Decoder Architecture. Image source.

The image on the left shows an Encoder-Decoder architecture, with both components composed of recurrent layers.

On the left, the encoder takes in the input sentences with each word represented by their embeddings and is expected to output a good summary for the input sentences. This summary is known as the context vector(the arrow connecting the two) and is fed to the decoder as its initial state.

The decoder on the right is in charge of outputting translations, one word per step. During training, it takes as inputs the target **sentences. **When making a prediction, it feeds itself with its output from the last step(as shown here).

This architecture is preferred over a simpler sequence-to-sequence RNN because the context vector from the encoder provides more direct access to the input sentence as a whole. Consequently, the decoder gets to look at the entire sentence via the context vector before outputting a single translation, while a regular sequence-to-sequence RNN only has access to words located before the current time step.

In all recurrent layers, information is passed via hidden states from time step to time step, and they are gradually ‘forgotten’ as more time steps are performed. When encoding a long sequence, the output context vector is likely to have forgotten much of the information about the first components of the sentence. The same applies to the decoder, the information contained in the context vector won’t pass down to the last few time steps if the sequence is too long.

#data-science #nlp #ai #machine-learning

The Transformer Isn’t As Hard To Understand As You Might Think
1.55 GEEK