ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators. What is the difference between ELECTRA and BERT?

BERT (Devlin et al., 2018) is the baseline of NLP tasks recently. There are a lot of new models released based on BERT architecture such as RoBERTA (Liu et al. 2019) and ALBERT (Lan et al., 2019). Clark et al. released ELECTRA (Clark et al., 2020) which target to reduce computation time and resource while maintaining high-quality performance. The trick is introducing the generator for Masked Langauge Model (MLM) prediction and forwarding the generator result to the discriminator

.MLM is one of the training objectives in BERT (Devlin et al., 2018). However, it is being criticized because of misaligned between the training phase and the fine-tuning phase. In short, the MLM mask token by [MASK] and model will predict the real world in order to learn the word representation. On the other hand, ELECTRA (Clark et al., 2020) contains two models which are generator and discriminator. The masked token will be sent to the generator and generating alternative inputs for discriminator (i.e. ELECTRA model). After the training phase, the generator will be thrown away while we only keep the discriminator for fine-tuning and inference.

Clark et al. named this method as replaced token detection. In the following sections, we will cover how does ELECTRA (Clark et al., 2020) works.

