Pre-training Language Models has taken over a majority of tasks in NLP. The 2017 paper, “Attention Is All You Need”, which proposed the Transformer architecture, changed the course of NLP. Based on that, several architectures like BERT, Open AI GPT evolved by leveraging self-supervised learning.
In this article, we discuss BERT : Bidirectional Encoder Representations from Transformers; which was proposed by Google AI in the paper, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. This is one of the groundbreaking models that has achieved the state of the art in many downstream tasks and is widely used.
BERT Pre-training and Fine-Tuning Tasks from the Paper (We will cover the architecture and specifications in the coming sections. Just observe that the same architecture is transferred for the fine-tuning tasks with minimal changes in the parameters).
BERT leverages a fine-tuning based approach for applying pre-trained language models; i.e. a common architecture is trained for a relatively generic task, and then, it is fine-tuned on specific downstream tasks that are more or less similar to the pre-training task.
To achieve this, the BERT paper proposes 2 pre-training tasks:
and fine-tuning on downstream tasks such as:
We will discuss these in-depth in the coming sections of this article.
_BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation in _Vaswani et al.
_— _BERT Paper
I have already covered the Transformer architecture in this post. Consider giving it a read if you are interested in knowing about the Transformer.
To elaborate on the BERT-specific architecture, we will compare the encoder and the decoder of the Transformer:
i.e. for a given word, the attention is computed using all the words in the sentence and not just the words preceding the given word in one of the left-to-right or right-to-left traversal order.
#bert #nlp #artificial-intelligence #deep learning