Hugging Face is an NLP-focused startup with a large open-source community, in particular around the Transformers library.Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:
pip install transformers
The library has seen super-fast growth in PyTorch and has recently been ported to TensorFlow 2.0, offering an API that now works with Keras’ fit API, TensorFlow Extended, and TPUs 👏. This blog post is dedicated to the use of the Transformers library using TensorFlow: using the Keras API as well as the TensorFlow TPUStrategy to fine-tune a State-of-The-Art Transformer model.
Transformers is based around the concept of pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.
The library builds on three main classes: a configuration class, a tokenizer class, and a model class.
bert-base-cased
. The configuration classes host these attributes with various I/O methods and standardized name properties.bert-base-cased-config.json
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
tf.keras.layers.Layer
which means it can be used very simply by the Keras’ fit API or trained using a custom training loop and GradientTape
.The advantage of using Transformers
lies in the straight-forward model-agnostic API. Loading a pre-trained model, along with its tokenizer can be done in a few lines of code. Here is an example of loading the BERT and GPT-2 TensorFlow models as well as their tokenizers:
model_load.py
from transformers import (TFBertModel, BertTokenizer,
TFGPT2Model, GPT2Tokenizer)
bert_model = TFBertModel.from_pretrained("bert-base-cased") # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
gpt2_model = TFGPT2Model.from_pretrained("gpt2") # Automatically loads the config
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Loading architectures is model-agnostic
The weights are downloaded from HuggingFace’s S3 bucket and cached locally on your machine. The models are ready to be used for inference or finetuned if need be. Let’s see that in action.
Fine-tuning a model is made easy thanks to some methods available in the Transformer library. The next parts are built as such:
We have made an accompanying colab notebook to get you fast on track with all the code. We’ll leverage the tensorflow_datasets package for data loading. Tensorflow-dataset provides us with a tf.data.Dataset , which can be fed into our glue_convert_examples_to_features method.
This method will make use of the tokenizer to tokenize the input and add special tokens at the beginning and the end of sequences (like [SEP], [CLS], or for instance) if such additional tokens are required by the model. This method returns a tf.data.Dataset
holding the featurized inputs.
We can then shuffle this dataset and batch it in batches of 32 units using standard tf.data.Dataset methods.
loading_data.py
import tensorflow_datasets
from transformers import glue_convert_examples_to_features
data = tensorflow_datasets.load("glue/mrpc")
train_dataset = data["train"]
validation_dataset = data["validation"]
train_dataset = glue_convert_examples_to_features(train_dataset, bert_tokenizer, 128, 'mrpc')
validation_dataset = glue_convert_examples_to_features(validation_dataset, bert_tokenizer, 128, 'mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
validation_dataset = validation_dataset.batch(64)
Building an input pipeline for our model
Training a model using Keras’ fit method has never been simpler. Now that we have the input pipeline setup, we can define the hyperparameters, and call the Keras’ fit method with our dataset.
keras_fit.py
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
bert_history = bert_model.fit(
bert_train_dataset,
epochs=2,
steps_per_epoch=115,
validation_data=bert_validation_dataset,
validation_steps=7
)
Training with a strategy gives you better control over what happens during the training. By switching between strategies, the user can select the distributed fashion in which the model is trained: from multi-GPUs to TPUs.
As of the time of writing, TPUStrategy is the only surefire way to train a model on a TPU using TensorFlow 2. Building a custom loop using a strategy makes even more sense in that regard, as strategies may easily be switched around and training on multi-GPU would require practically no code change.
Building a custom loop requires a bit of work to set-up, therefore the reader is advised to open the following colab notebook to have a better grasp of the subject at hand. It does not go into the detail of tokenization as the first colab has done, but it shows how to build an input pipeline that will be used by the TPUStrategy.
This makes use of Google Cloud Platform bucket as a means to host data, as TPUs are complicated to handle when using local filesystems. The colab notebook is available here.
The main selling point of the Transformers library is its model agnostic and simple API. Acting as a front-end to models that obtain state-of-the-art results in NLP, switching between models according to the task at hand is extremely easy.
As an example, here’s the complete script to fine-tune BERT on a language classification task(MRPC):
BERT_keras.py
model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
data = tensorflow_datasets.load("glue/mrpc")
train_dataset = data["train"]
train_dataset = glue_convert_examples_to_features(train_dataset, tokenizer, 128, 'mrpc')
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(train_dataset, epochs=3)
However, in a production environment, memory is scarce. You would like to use a smaller model instead; switching to DistilBERT for example. Simply change the first two lines to these two in order to do so:
DistilBERT_keras.py
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilbertTokenizer.from_pretrained("distilbert-base-uncased")
As a platform hosting 10+ Transformer architectures, Transformers makes it very easy to use, fine-tune and compare the models that have transfigured the deep-learning for NLP field. It serves as a backend for many downstream apps that leverage transformer models and is in use in production by many different companies. We’ll welcome any question or issue you might have on our GitHub repository.
#TensorFlow #NLP #TensorFlow2.0 #ai