Benchmarking Python NLP Tokenizers - In this blog post, I will benchmark (i.e. time) a few tokenizers including NLTK, spaCy, and Keras...
I love building and writing about bag-of-words classification models, but HATE waiting for tokenizers to execute when I have a lot of text to analyze. A tokenizer is simply a function that breaks a string into a list of words (i.e. tokens) as shown below:
Since I have been working in the NLP space for a few years now, I have come across a few different functions for tokenization. In this blog post, I will benchmark (i.e. time) a few tokenizers including NLTK, spaCy, and Keras.
TL;DR: Don’t use NLTK’s wordtokenize use NLTK’s regexptokenize
In this blog post, we will be timing a few different tokenizers. Obviously the timing is going to be dependent on the specific computer, but in general we are looking for the one that performs the best. Let’s start by importing a few packages.
We will be using Python 3.6 and the versions of the other relevant packages are shown below:
I am a healthcare data scientist, so in this blog post I will use the publically available MIMIC III dataset which is a set of deidentified medical data collected at Beth Israel Deaconess Medical Center in Boston, Massachusetts. If you would like to get access to this dataset, you can follow my instructions here. Let’s load the data into a pandas dataframe.
Here we will use the NOTEEVENTS.csv provided by MIMIC III which currently has over 2 million notes from a range of clinical note categories including discharge summaries, physician, consult, pharmacy, social worker, radiology, nursing, etc. The notes are contained in the column ‘TEXT’. We can find the length of each note in characters using:
We can then plot a histogram to see we have a range of different lengths
For this post, we just need a list of notes, so let’s extract them from the pandas dataframe.
To make things simple later on, let’s shuffle the notes just in case there is an order to the MIMIC dataset.
In this post, we are going to write a few functions that follow this structure for each custom tokenizer
In most NLP tasks people tend to clean the data in some way, so we will use the same cleaning function with each custom tokenizer.
If we pass all punctuation and numbers to NLTK’s
word_tokenize we end up with the following list
As we can see NLTK splits on some punctuation but not all and numbers are still included. Each custom tokenizer could have slightly different rules on splitting with punctuation.
For my particular NLP classification tasks, I have decided that all punctuation and numbers should be replaced with a space — to remove it and prevent combining of two words adjacent to the punctuation. I am also making the assumption that single quotes tend to be included in a contraction and I would like to make [can’t] turn into [cant] instead of two tokens[can, t], so we will replace single quotes with an empty character. To accomplish this I have created the following
To demonstrate how to time a function, I will now time
which demonstrates the time for a different number of notes ending with 100000 notes taking 12.7 seconds. Note that if you run this a few times you may get slightly different numbers depending on the state of your computer, but for the purposes of this post we will just run it once.
You might also enjoy: Python for NLP: Developing an Automatic Text Filler using N-Grams
One of the standard tokenizers is
wordtokenize which is contained in the NLTK package. We can make our function that uses cleantext and time it (saving the times) below:
Well that’s just disappointing: it takes 5 minutes to just tokenize 100000 notes. This is kind of annoying if you are playing with hyperparameters of a Vectorizer for your NLP Bag-of-words model. Note that the cleaning function plays a minimal role with this tokenizer (12 seconds out of 291 seconds). Let’s see if we can do better.
Built into the standard python libraries is the re package which stands for regular expressions. The library contains functions for quickly searching through text. Here is our tokenize function that uses regular expressions:
It is amazing, look how fast regexp runs!
wordtokenize must be adding extra bells and whistles to the tokenizer. I was curious what was included so I looked at the source code. All I was able to learn was that it uses a tree bank tokenizer. However, looking at the source code pointed me to another tokenizer in NLTK that just uses regular expressions: regexptokenize.
The function and timing for
regexp_tokenize is shown below
This is on par and perhaps even slightly better than just regular expression’s implementation!
Recently, I have been reading and watching a few tutorials about spaCy. The landing page for the package says “The library respects your time, and tries to avoid wasting it” which is encouraging to me. spaCy is very powerful and has a lot of built in functions to help with named entity recognition, part-of-speech tagging, word vectors and much more. However, for our simple task, all we need to do is tokenize.
I’m going to load the ‘en’ spaCy package and disable the named entity regonition, parser and tagger. This will remove some of the default options that we don’t ultimately need and would take time if we didn’t disable them.
The timing is shown below:
As you can see this isn’t as fast as the regexp tokenizers but is at least similar to NLTK’s
word_tokenize . Obviously, there must be a few extra default options in spaCy’s tokenizer (more on this later).
Keras is a very popular library for building neural networks in Python. It also contains a word tokenizer
texttoword_sequence (although not as obvious name). The function and timings are shown below:
which is similar to the regexp tokenizers. If you look under the hood you can see it is also using regexp to split. Similarly, it also has options for filtering out characters which uses the same
text.translate technique we used in
We can plot the times for all these tests as shown below
Benchmark of a few tokenizers
So far we have just been analyzing the times of executing the tokenizer function, which have suggested reg exp methods are the best. However, we have not investigated that the tokenizers produce the same results yet! For Bag-of-Words classifieres, we tend to use a tokenizer with some type of vectorizer (function for converting tokens into numerical columns). Here I will use the CountVectorizer from scikit learn’s package and extract the vocabulary. Note that I’m going to turn the max_features to None so I get all possible vocab because I noticed that one of the functions didn’t match!
Here is the code for getting the vocab for one tokenizer and is repeated for each tokenizer
spAcy was the only tokenizer that did not have the same vocab as the others. We can look at the vocab that is missing from spaCy’s set with the following:
This is interesting because they are all contractions, which probably means spaCy handles contractions differently. We can confirm this with the following:
Which sheds a bit more light on why spaCy takes longer than the regexp tokenizers.
In this post, we benchmarked a few different tokenizers. We learned that NLTK’s word_tokenize and spaCy have extra functionality built in by default adding significant time to execution. If you need the extra functionality, you’ll have to add time to the execution!
It appears regular expressions is currently the fast implementation and is also contained within NLTK and Keras. Let me know if you know of a faster solution!
Since I already have NLTK installed in most of my projects, I will be switching to regexptokenize instead of wordtokenize!
Originally published by Andrew Long at towardsdatascience.com
In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.
Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc.. You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like __init__, __call__, __str__ etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).
Python is an interpreted, high-level, powerful general-purpose programming language. You may ask, Python’s a snake right? and Why is this programming language named after it?
Are you looking for experienced, reliable, and qualified Python developers? If yes, you have reached the right place. At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, our full-stack Python development services...
Python any() function returns True if any element of an iterable is True otherwise any() function returns False. The syntax is any().