Hi and welcome to the second step of the Language Modeling Path; a series of articles from Machine Learning Reply aimed to cover the most important milestones that brought to life the huge language models able to imitate and (let’s say) understand the human language like BERT and GPT-3.

In this article we are going to talk about **attention **layers. This kind of architectural trick was first applied to the computer vision field [1] but in this article we will focus only on the Neural Natural Language Processing application and in particular on the sequence to sequence applications for Neural Machine Translation (NMT). While this article is based on two papers ([2] and [3]), attention is a widespread technique that you can find explained everywhere if you need to go deeper.

To better understand this chapter, it is strongly suggested to have a good comprehension of encoder-decoder sequence-to-sequence models. If you need a good refresh of the key concept of those architectures, you can start from the first chapter of our Language Modeling Path.

Attention in the real world

The introduction of attention mechanism in common seq2seq applications allows the elaboration of longer and more complex sentences. As we anticipated the basic insight of this trick was born inside the **computer vision **framework and then developed around natural language for Neural Machine Translation (NMT) applications. In this article we will focus on NMT simply because this is the natural habitat of such algorithm but nevertheless the family of attention-based models (models that rely on this particular architectural pattern) counts among its ranks many state of the art models in most of Natural Language Processing application fields.

This is because attention mechanism is the key that makes Google Assistant and Amazon Alexa understand our intentions even if we use more than a simple sentence to express it.

It can give a boost of accuracy in all the applications that require text embedding. Here it is a brief (incomplete) list of topics where we were able to experience its improvements with respect to not attention-based models.

  • Document retrieval
  • Text classification
  • Text clustering
  • Text similarity
  • Personalized search engines
  • Text generation

Moreover, the attention mechanism gives a more detailed insight on what part of the input had the highest impact on the decision made by our model, and this is a huge advantage in a production environment making the black box neural network a little less black.

#deep-learning #bert #nlp #gpt-3 #attention #machine learning

Neural Networks, at Attention!
1.30 GEEK