Rusty  Shanahan

Rusty Shanahan

1594548180

Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet

One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) is _pre-trained _on a gigantic dataset using an unsupervised _pre-training objective. _This same model is then _fine-tuned _(typically supervised training) on the actual task at hand. The beauty of this approach is that the _fine-tuning _dataset can be as small as 500–1000 training samples! A number small enough to be potentially scoffed out of the room if one were to call it Deep Learning. This also means that the expensive and time-consuming part of the pipeline, pre-training, only needs to be done once and the _pre-trained _model can be reused for any number of tasks thereafter. Since _pre-trained _models are typically made publicly available 🙏, we can grab the relevant model, _fine-tune _it on a custom dataset, and have a state-of-the-art model ready to go in a few hours!

If you are interested in learning how pre-training works and how you can train a brand new language model on a single GPU, check out my article linked below!

Understanding ELECTRA and Training an ELECTRA Language Model

How does a Transformer Model learn a language? What’s new in ELECTRA? How do you train your own language model on a…

towardsdatascience.com

ELECTRA is one of the latest classes of _pre-trained _Transformer models released by Google and it switches things up a bit compared to most other releases. For the most part, Transformer models have followed the well-trodden path of Deep Learning, with larger models, more training, and bigger datasets equalling better performance. ELECTRA, however, bucks this trend by outperforming earlier models like BERT while using less computational power, smaller datasets, _and _less training time. (In case you are wondering, ELECTRA is the same “size” as BERT).

In this article, we’ll look at how to use a _pre-trained _ELECTRA model for text classification and we’ll compare it to other standard models along the way. Specifically, we’ll be comparing the final performance (Matthews correlation coefficient (MCC)) and the training times for each model listed below.

  • electra-small
  • electra-base
  • bert-base-cased
  • distilbert-base-cased
  • distilroberta-base
  • roberta-base
  • xlnet-base-cased

As always, we’ll be doing this with the Simple Transformers library (based on the Hugging Face Transformers library) and we’ll be using Weights & Biases for visualizations.

You can find all the code used here in the examples directory of the library.

Installation

  1. Install Anaconda or Miniconda Package Manager from here.
  2. Create a new virtual environment and install packages.
  3. conda create -n simpletransformers python pandas tqdm
  4. conda activate simpletransformers
  5. conda install pytorch cudatoolkit=10.1 -c pytorch
  6. Install Apex if you are using fp16 training. Please follow the instructions here.
  7. Install simpletransformers.
  8. pip install simpletransformers

Data Preparation

We’ll be using the Yelp Review Polarity dataset which is a binary classification dataset. The script below will download it and store it in the data directory. Alternatively, you can manually download the data from FastAI.

mkdir data
	wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz -O data/data.tgz
	tar -xvzf data/data.tgz -C data/
	mv data/yelp_review_polarity_csv/* data/
	rm -r data/yelp_review_polarity_csv/
	rm data/data.tgz
view raw
data_download.sh hosted with ❤ by GitHub

Hyperparameters

Once the data is in the data directory, we can start training our models.

Simple Transformers models can be configured extensively (see docs), but we’ll just be going with some basic, “good enough” hyperparameter settings. This is because we are more interested in comparing the models to each other on an equal footing, rather than trying to optimize for the absolute best hyperparameters for each model.

#machine-learning #data-science #nlp #data analysis

What is GEEK

Buddha Community

Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet

ELECTRA outperforms RoBERTa, ALBERT and XLNet- with Python code -

ELECTRA achieves state-of-the-art performance in language representation learning by outperforming present leaders RoBERTa, ALBERT and XLNet. On the other hand, ELECTRA works efficiently with relatively less compute than any language representation learning methods.

Read more: https://analyticsindiamag.com/how-electra-outperforms-roberta-albert-and-xlnet/

#ai #roberta, #albert #xlnet #bert

Ajay Kapoor

1624252974

Digital Transformation Consulting Services & solutions

Compete in this Digital-First world with PixelCrayons’ advanced level digital transformation consulting services. With 16+ years of domain expertise, we have transformed thousands of companies digitally. Our insight-led, unique, and mindful thinking process helps organizations realize Digital Capital from business outcomes.

Let our expert digital transformation consultants partner with you in order to solve even complex business problems at speed and at scale.

Digital transformation company in india

#digital transformation agency #top digital transformation companies in india #digital transformation companies in india #digital transformation services india #digital transformation consulting firms

What Is Google’s Recently Launched BigBird

Recently, Google Research introduced a new sparse attention mechanism that improves performance on a multitude of tasks that require long contexts known as BigBird. The researchers took inspiration from the graph sparsification methods.

They understood where the proof for the expressiveness of Transformers breaks down when full-attention is relaxed to form the proposed attention pattern. They stated, “This understanding helped us develop BigBird, which is theoretically as expressive and also empirically useful.”

Why is BigBird Important?
Bidirectional Encoder Representations from Transformers or BERT, a neural network-based technique for natural language processing (NLP) pre-training has gained immense popularity in the last two years. This technology enables anyone to train their own state-of-the-art question answering system.

#developers corner #bert #bert model #google #google ai #google research #transformer #transformer model

Rusty  Shanahan

Rusty Shanahan

1594548180

Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet

One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) is _pre-trained _on a gigantic dataset using an unsupervised _pre-training objective. _This same model is then _fine-tuned _(typically supervised training) on the actual task at hand. The beauty of this approach is that the _fine-tuning _dataset can be as small as 500–1000 training samples! A number small enough to be potentially scoffed out of the room if one were to call it Deep Learning. This also means that the expensive and time-consuming part of the pipeline, pre-training, only needs to be done once and the _pre-trained _model can be reused for any number of tasks thereafter. Since _pre-trained _models are typically made publicly available 🙏, we can grab the relevant model, _fine-tune _it on a custom dataset, and have a state-of-the-art model ready to go in a few hours!

If you are interested in learning how pre-training works and how you can train a brand new language model on a single GPU, check out my article linked below!

Understanding ELECTRA and Training an ELECTRA Language Model

How does a Transformer Model learn a language? What’s new in ELECTRA? How do you train your own language model on a…

towardsdatascience.com

ELECTRA is one of the latest classes of _pre-trained _Transformer models released by Google and it switches things up a bit compared to most other releases. For the most part, Transformer models have followed the well-trodden path of Deep Learning, with larger models, more training, and bigger datasets equalling better performance. ELECTRA, however, bucks this trend by outperforming earlier models like BERT while using less computational power, smaller datasets, _and _less training time. (In case you are wondering, ELECTRA is the same “size” as BERT).

In this article, we’ll look at how to use a _pre-trained _ELECTRA model for text classification and we’ll compare it to other standard models along the way. Specifically, we’ll be comparing the final performance (Matthews correlation coefficient (MCC)) and the training times for each model listed below.

  • electra-small
  • electra-base
  • bert-base-cased
  • distilbert-base-cased
  • distilroberta-base
  • roberta-base
  • xlnet-base-cased

As always, we’ll be doing this with the Simple Transformers library (based on the Hugging Face Transformers library) and we’ll be using Weights & Biases for visualizations.

You can find all the code used here in the examples directory of the library.

Installation

  1. Install Anaconda or Miniconda Package Manager from here.
  2. Create a new virtual environment and install packages.
  3. conda create -n simpletransformers python pandas tqdm
  4. conda activate simpletransformers
  5. conda install pytorch cudatoolkit=10.1 -c pytorch
  6. Install Apex if you are using fp16 training. Please follow the instructions here.
  7. Install simpletransformers.
  8. pip install simpletransformers

Data Preparation

We’ll be using the Yelp Review Polarity dataset which is a binary classification dataset. The script below will download it and store it in the data directory. Alternatively, you can manually download the data from FastAI.

mkdir data
	wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz -O data/data.tgz
	tar -xvzf data/data.tgz -C data/
	mv data/yelp_review_polarity_csv/* data/
	rm -r data/yelp_review_polarity_csv/
	rm data/data.tgz
view raw
data_download.sh hosted with ❤ by GitHub

Hyperparameters

Once the data is in the data directory, we can start training our models.

Simple Transformers models can be configured extensively (see docs), but we’ll just be going with some basic, “good enough” hyperparameter settings. This is because we are more interested in comparing the models to each other on an equal footing, rather than trying to optimize for the absolute best hyperparameters for each model.

#machine-learning #data-science #nlp #data analysis

Chelsie  Towne

Chelsie Towne

1596716340

A Deep Dive Into the Transformer Architecture – The Transformer Models

Transformers for Natural Language Processing

It may seem like a long time since the world of natural language processing (NLP) was transformed by the seminal “Attention is All You Need” paper by Vaswani et al., but in fact that was less than 3 years ago. The relative recency of the introduction of transformer architectures and the ubiquity with which they have upended language tasks speaks to the rapid rate of progress in machine learning and artificial intelligence. There’s no better time than now to gain a deep understanding of the inner workings of transformer architectures, especially with transformer models making big inroads into diverse new applications like predicting chemical reactions and reinforcement learning.

Whether you’re an old hand or you’re only paying attention to transformer style architecture for the first time, this article should offer something for you. First, we’ll dive deep into the fundamental concepts used to build the original 2017 Transformer. Then we’ll touch on some of the developments implemented in subsequent transformer models. Where appropriate we’ll point out some limitations and how modern models inheriting ideas from the original Transformer are trying to overcome various shortcomings or improve performance.

What Do Transformers Do?

Transformers are the current state-of-the-art type of model for dealing with sequences. Perhaps the most prominent application of these models is in text processing tasks, and the most prominent of these is machine translation. In fact, transformers and their conceptual progeny have infiltrated just about every benchmark leaderboard in natural language processing (NLP), from question answering to grammar correction. In many ways transformer architectures are undergoing a surge in development similar to what we saw with convolutional neural networks following the 2012 ImageNet competition, for better and for worse.

#natural language processing #ai artificial intelligence #transformers #transformer architecture #transformer models