Pre-training Language Models has taken over a majority of tasks in NLP. The 2017 paper, “Attention Is All You Need”, which proposed the Transformer architecture, changed the course of NLP. Based on that, several architectures like BERT, Open AI GPT evolved by leveraging self-supervised learning.

In this article, we discuss BERT : Bidirectional Encoder Representations from Transformers; which was proposed by Google AI in the paper, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. This is one of the groundbreaking models that has achieved the state of the art in many downstream tasks and is widely used.

Overview

Image for post

BERT Pre-training and Fine-Tuning Tasks from the Paper (We will cover the architecture and specifications in the coming sections. Just observe that the same architecture is transferred for the fine-tuning tasks with minimal changes in the parameters).

BERT leverages a fine-tuning based approach for applying pre-trained language models; i.e. a common architecture is trained for a relatively generic task, and then, it is fine-tuned on specific downstream tasks that are more or less similar to the pre-training task.

To achieve this, the BERT paper proposes 2 pre-training tasks:

  1. Masked Language Modeling (MLM)
  2. Next Sentence Prediction (NSP)

and fine-tuning on downstream tasks such as:

  1. Sequence Classification
  2. Named Entity Recognition (NER)
  3. Natural Language Inference (NLI) or Textual Entailment
  4. Grounded Common Sense Inference
  5. Question Answering (QnA)

We will discuss these in-depth in the coming sections of this article.

BERT Architecture

_BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation in _Vaswani et al.

_— _BERT Paper

I have already covered the Transformer architecture in this post. Consider giving it a read if you are interested in knowing about the Transformer.

To elaborate on the BERT-specific architecture, we will compare the encoder and the decoder of the Transformer:

  • **The Transformer Encoder **is essentially a Bidirectional Self-Attentive Model, that uses all the tokens in a sequence to attend each token in that sequence

i.e. for a given word, the attention is computed using all the words in the sentence and not just the words preceding the given word in one of the left-to-right or right-to-left traversal order.

#bert #nlp #artificial-intelligence #deep learning

BERT: Pre-Training of Transformers for Language Understanding
6.70 GEEK