One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) is _pre-trained _on a gigantic dataset using an unsupervised _pre-training objective. _This same model is then _fine-tuned _(typically supervised training) on the actual task at hand. The beauty of this approach is that the _fine-tuning _dataset can be as small as 500–1000 training samples! A number small enough to be potentially scoffed out of the room if one were to call it Deep Learning. This also means that the expensive and time-consuming part of the pipeline, pre-training, only needs to be done once and the _pre-trained _model can be reused for any number of tasks thereafter. Since _pre-trained _models are typically made publicly available 🙏, we can grab the relevant model, _fine-tune _it on a custom dataset, and have a state-of-the-art model ready to go in a few hours!

If you are interested in learning how pre-training works and how you can train a brand new language model on a single GPU, check out my article linked below!

ELECTRA is one of the latest classes of _pre-trained _Transformer models released by Google and it switches things up a bit compared to most other releases. For the most part, Transformer models have followed the well-trodden path of Deep Learning, with larger models, more training, and bigger datasets equalling better performance. ELECTRA, however, bucks this trend by outperforming earlier models like BERT while using less computational power, smaller datasets, _and _less training time. (In case you are wondering, ELECTRA is the same “size” as BERT).

In this article, we’ll look at how to use a _pre-trained _ELECTRA model for text classification and we’ll compare it to other standard models along the way. Specifically, we’ll be comparing the final performance (Matthews correlation coefficient (MCC)) and the training times for each model listed below.

electra-small
electra-base
bert-base-cased
distilbert-base-cased
distilroberta-base
roberta-base
xlnet-base-cased

As always, we’ll be doing this with the Simple Transformers library (based on the Hugging Face Transformers library) and we’ll be using Weights & Biases for visualizations.

You can find all the code used here in the examples directory of the library.

Installation

Install Anaconda or Miniconda Package Manager from here.
Create a new virtual environment and install packages.
conda create -n simpletransformers python pandas tqdm
conda activate simpletransformers
conda install pytorch cudatoolkit=10.1 -c pytorch
Install Apex if you are using fp16 training. Please follow the instructions here.
Install simpletransformers.
pip install simpletransformers

Data Preparation

We’ll be using the Yelp Review Polarity dataset which is a binary classification dataset. The script below will download it and store it in the data directory. Alternatively, you can manually download the data from FastAI.

mkdir data
	wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz -O data/data.tgz
	tar -xvzf data/data.tgz -C data/
	mv data/yelp_review_polarity_csv/* data/
	rm -r data/yelp_review_polarity_csv/
	rm data/data.tgz
view raw
data_download.sh hosted with ❤ by GitHub

Hyperparameters

Once the data is in the data directory, we can start training our models.

Simple Transformers models can be configured extensively (see docs), but we’ll just be going with some basic, “good enough” hyperparameter settings. This is because we are more interested in comparing the models to each other on an equal footing, rather than trying to optimize for the absolute best hyperparameters for each model.

#machine-learning #data-science #nlp #data analysis

Understanding ELECTRA and Training an ELECTRA Language Model

Installation

Data Preparation

Hyperparameters

towardsdatascience.com

Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet