Introduction

BART is a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension -

Don’t worry if that sounds a little complicated; we are going to break it down and see what it all means. To add a little bit of background before we dive into BART, it’s time for the now-customary ode to Transfer Learning with self-supervised models. It’s been said many times over the past couple of years, but Transformers really have achieved incredible success in a wide variety of Natural Language Processing (NLP) tasks.

BART uses a standard Transformer architecture (Encoder-Decoder) like the original Transformer model used for neural machine translation but also incorporates some changes from BERT (only uses the encoder) and GPT (only uses the decoder). You can refer to the _2.1 Architecture _section of the BART paper for more details.

Pre-Training BART

BART is pre-trained by minimizing the cross-entropy loss between the decoder output and the original sequence.

Masked Language Modeling (MLM)

MLM models such as BERT are pre-trained to predict masked tokens. This process can be broken down as follows:

  1. Replace a random subset of the input with a _mask token [MASK]. _(Adding noise/corruption)
  2. The model predicts the original tokens for each of the [MASK]tokens. (Denoising)

Importantly, BERT models can “see” the full input sequence (with some tokens replaced with [MASK]) when attempting to predict the original tokens. This makes BERT a bidirectional model, i.e. it can “see” the tokens before and after the masked tokens.

#machine-learning #artificial-intelligence #data-science #nlp #deep learning

BART for Paraphrasing with Simple Transformers
20.20 GEEK