Faster and easier state-of-the-art NLP in TensorFlow 2.0

Faster and easier state-of-the-art NLP in TensorFlow 2.0

State of the art faster Natural Language Processing in Tensorflow 2.0. tf-transformers is designed to harness the full power of Tensorflow 2, to make it much faster and simpler comparing to existing Tensorflow based NLP architectures. On an average, there is 80 % improvement over current exsting Tensorflow based libraries, on text generation and other tasks. You can find more details in the Benchmarks section.

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0

tf-transformers is designed to harness the full power of Tensorflow 2, to make it much faster and simpler comparing to existing Tensorflow based NLP architectures. On an average, there is 80 % improvement over current exsting Tensorflow based libraries, on text generation and other tasks. You can find more details in the Benchmarks section.

All / Most NLP downstream tasks can be integrated into Tranformer based models with much ease. All the models can be trained using, which supports GPU, multi-GPU, TPU.

Unique Features

  • Faster AutoReggressive Decoding using Tensorflow2. Faster than PyTorch in most experiments (V100 GPU). 80% faster compared to existing TF based libararies (relative difference) Refer benchmark code.
  • Complete TFlite support for BERT, RoBERTA, T5, Albert, mt5 for all down stream tasks except text-generation
  • Faster sentence-piece alignment (no more LCS overhead)
  • Variable batch text generation for Encoder only models like GPT2
  • No more hassle of writing long codes for TFRecords. minimal and simple.
  • Off the shelf support for auto-batching or tf.ragged tensors
  • Pass dictionary outputs directly to loss functions inside using model.compile2 . Refer examples or blog
  • Multiple mask modes like causal, user-defined, prefix by changing one argument . Refer examples or blog

Performance Benchmarks

Evaluating performance benhcmarks is trickier. I evaluated tf-transformers, primarily on text-generation tasks with GPT2 small and t5 small, with amazing HuggingFace, as it is the ready to go library for NLP right now. Text generation tasks require efficient caching to make use of past Key and Value pairs.

On an average, tf-transformers is 80 % faster than HuggingFace Tensorflow implementation and in most cases it is comparable or faster than PyTorch.

1. GPT2 benchmark

The evaluation is based on average of 5 runs, with different batch_size, beams, sequence_length etc. So, there is qute a larg combination, when it comes to BEAM and top-k8 decoding. The figures are *randomly taken 10 samples. But, you can see the full code and figures in the repo.

  • GPT2 greedy

This is image title

  • GPT2 beam

This is image title

  • GPT2 top-k top-p

This is image title

  • GPT2 greedy histogram

This is image title

Codes to reproduce GPT2 benchmark experiments

Codes to reproduce T5 benchmark experiments


I am providing some basic tutorials here, which covers basics of tf-transformers and how can we use it for other downstream tasks. All/most tutorials has following structure:

  • Introduction About the Problem
  • Prepare Training Data
  • Load Model and asociated downstream Tasks
  • Define Optimizer, Loss
  • Train using Keras and CustomTrainer
  • Evaluate Using Dev data
  • In Producton - Secton defines how can we use tf.saved_model in production + pipelines

Production Ready Tutorials

Start by converting HuggingFace models (base models only) to tf-transformers models.

Here are a few examples : Jupyter Notebooks:

Why should I use tf-transformers?

  1. Use state-of-the-art models in Production, with less than 10 lines of code.

    • High performance models, better than all official Tensorflow based models
    • Very simple classes for all downstream tasks
    • Complete TFlite support for all tasks except text-generation
  2. Make industry based experience to avaliable to students and community with clear tutorials

  3. Train any model on GPU, multi-GPU, TPU with amazing

    • Train state-of-the-art models in few lines of code.
    • All models are completely serializable.
  4. Customize any models or pipelines with minimal or no code change.

Do we really need to distill? Jont Loss is all we need.


We have conducted few experiments to squeeze the power of Albert base models ( concept is applicable to any models and in tf-transformers, it is out of the box.)

The idea is minimize the loss for specified task in each layer of your model and check predictions at each layer. as per our experiments, we are able to get the best smaller model (thanks to Albert), and from layer 4 onwards we beat all the smaller model in GLUE benchmark. By layer 6, we got a GLUE score of 81.0, which is 4 points ahead of Distillbert with GLUE score of 77 and MobileBert GLUE score of 78.

The Albert model has 14 million parameters, and by using layer 6, we were able to speed up the compuation by 50% .

The concept is applicable to all the models.

Codes to reproduce GLUE Joint Loss experiments

This is image title

Benchmark Results

  • GLUE score ( not including WNLI )
2. SQUAD v1.1

We have trained Squad v1.1 with joint loss. At layer 6 we were able to achieve same performance as of Distillbert - (EM - 78.1 and F1 - 86.2), but slightly worser than MobileBert.

Benchmark Results

This is image title

Codes to reproduce Squad v1.1 Joint Loss experiments

Note: We have a new model in pipeline. :-)


With pip

This repository is tested on Python 3.7+, and Tensorflow 2.4.0

Recommended to use a virtual environment.

Assuming Tensorflow 2.0 is installed

pip install tf-transformers

From Github

Assuming poetry is installed. If not pip install poetry .

git clone

cd tf-transformers

poetry install


Pipeline in tf-transformers is different from HuggingFace. Here, pipeline for specific tasks expects a model and tokenizer_fn. Because in an ideal scenario, no one will be able to understand whats the kind of pre-processing we want to do to our inputs. Please refer above tutorial notebooks for examples.

Token Classificaton Pipeline (NER)

from tf_transformers.pipeline import Token_Classification_Pipeline

def tokenizer_fn(feature):
    feature: tokenized text (tokenizer.tokenize)
    result = {}
    result["input_ids"] = tokenizer.convert_tokens_to_ids([tokenizer.cls_token] +  feature['input_ids'] + [tokenizer.bos_token])
    result["input_mask"] = [1] * len(result["input_ids"])
    result["input_type_ids"] = [0] * len(result["input_ids"])
    return result

# load Keras/ Serialized Model
model_ner = # Load Model
slot_map_reverse = # dictionary index - entity mapping
pipeline = Token_Classification_Pipeline( model = model_ner,
                tokenizer = tokenizer,
                tokenizer_fn = tokenizer_fn,
                label_map = slot_map_reverse,
                max_seq_length = 128,

sentences = ['I would love to listen to Carnatic music by Yesudas',
            'Play Carnatic Fusion by Various Artists',
            'Please book 2 tickets from Bangalore to Kerala']
result = pipeline(sentences)

Span Selection Pipeline (QA)

from tf_transformers.pipeline import Span_Extraction_Pipeline

def tokenizer_fn(features):
    features: dict of tokenized text
    Convert them into ids

    result = {}
    input_ids = tokenizer.convert_tokens_to_ids(features['input_ids'])
    input_type_ids = tf.zeros_like(input_ids).numpy().tolist()
    input_mask = tf.ones_like(input_ids).numpy().tolist()
    result['input_ids'] = input_ids
    result['input_type_ids'] = input_type_ids
    result['input_mask'] = input_mask
    return result

model = # Load keras/ saved_model
# Span Extraction Pipeline
pipeline = Span_Extraction_Pipeline(model = model,
                tokenizer = tokenizer,
                tokenizer_fn = tokenizer_fn,
                n_best_size = 20,
                n_best = 5,
                max_answer_length = 30,
                max_seq_length = 384,

questions = ['When was Kerala formed?']
contexts = ['''Kerala (English: /ˈkɛrələ/; Malayalam: [ke:ɾɐɭɐm] About this soundlisten (help·info)) is a state on the southwestern Malabar Coast of India. It was formed on 1 November 1956, following the passage of the States Reorganisation Act, by combining Malayalam-speaking regions of the erstwhile states of Travancore-Cochin and Madras. Spread over 38,863 km2 (15,005 sq mi), Kerala is the twenty-first largest Indian state by area. It is bordered by Karnataka to the north and northeast, Tamil Nadu to the east and south, and the Lakshadweep Sea[14] to the west. With 33,387,677 inhabitants as per the 2011 Census, Kerala is the thirteenth-largest Indian state by population. It is divided into 14 districts with the capital being Thiruvananthapuram. Malayalam is the most widely spoken language and is also the official language of the state.[15]''']
result = pipeline(questions=questions, contexts=contexts)

Classification Model Pipeline

from tf_transformers.pipeline import Classification_Pipeline
from import pad_dataset_normal

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
max_seq_length = 128

def tokenizer_fn(texts):
    feature: tokenized text (tokenizer.tokenize)
    pad_dataset_noral will automatically pad it.
    input_ids = []
    input_type_ids = []
    input_mask = []
    for text in texts:
        input_ids_ex = [tokenizer.cls_token] + tokenizer.tokenize(text)[: max_seq_length-2] + [tokenizer.sep_token] # -2 to add CLS and SEP
        input_ids_ex = tokenizer.convert_tokens_to_ids(input_ids_ex)
        input_mask_ex = [1] * len(input_ids_ex)
        input_type_ids_ex = [0] * len(input_ids_ex)


    result = {}
    result['input_ids'] = input_ids
    result['input_type_ids'] = input_type_ids
    result['input_mask'] = input_mask
    return result

model = # Load keras/ saved_model
label_map_reverse = {0: 'unacceptable', 1: 'acceptable'}
pipeline = Classification_Pipeline( model = model,
                tokenizer_fn = tokenizer_fn,
                label_map = label_map_reverse,

sentences = ['In which way is Sandy very anxious to see if the students will be able to solve the homework problem?',
            'The book was written by John.',
            'Play Carnatic Fusion by Various Artists',
            'She voted herself.']
result = pipeline(sentences)

Supported Models architectures

tf-transformers currently provides the following architectures .

  1. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  2. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  3. BERT For Sequence Generation (from Google) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
  4. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  5. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
  6. MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
  7. RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  8. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.


tf-transformers is a personal project. This has nothing to do with any organization. So, I might not be able to host equivalent checkpoints of all base models. As a result, there is a conversion notebooks, to convert above mentioned architectures from HuggingFace to tf-transformers.


I want to give credits to Tensorflow NLP official repository. I used November 2019 version of master branch ( where tf.keras.Network) was used for models. I have modified that by large extend now.

Apart from that, I have used many common scripts from many open repos. I might not be able to recall everything as it is. But still credit goes to them too.

Download Details:

Author: legacyai Source Code:

tensorflow machine-learning deep-learning data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

How To Build A Data Science Career In 2021

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

PyTorch for Deep Learning | Data Science | Machine Learning | Python

PyTorch for Deep Learning | Data Science | Machine Learning | Python. PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning. Python is a very flexible language for programming and just like python, the PyTorch library provides flexible tools for deep learning.

Machine Learning And Deep Learning Interview Questions For Data Science Interview

This video will help you get an idea about the top machine learning and deep learning interview questions that are crucial to crack any data science interview. We have included conceptual, theoretical and practical questions on machine learning and deep learning techniques. Let’s begin!

Deep Learning vs Machine Learning vs Artificial Intelligence vs Data Science

This "Deep Learning vs Machine Learning vs AI vs Data Science" video talks about the differences and relationship between Artificial Intelligence, Machine Learning, Deep Learning, and Data Science.