Ever since I started working on NLP(Natural Language Processing), I have been wondering which one is the best NLP library that can meet most of our common NLP requirements. Although it is true that there is no one-size fits all, and the choice of library would depend on the task at hand, I was still curious as to how different libraries would compare if they were to be bench-marked against a very simple task.

With that in mind, I put on my developer hat and set out writing python code using various libraries, to evaluate them against a very common task. To keep things simple, I decided to use the Twitter text-classification problem for the evaluation. The most common NLP libraries today are NLTK, Spacy, WordBlob, Gensim, and of-course Deep Neural Network architectures using LSTM(Long Short Term Memory) or GRU(Gated Recurrent Unit)cells.

The problem statement

The dataset I am using consists of a collection of Twitter tweets. Some of the tweets are labeled as racist while others are not. This is a classical supervised learning based binary-classification problem. Our job is to create models based on different libraries, and use them to classify previously unseen text as racist or not.

Here is a look at some of the available tweets:

Image for post

The label 1 means the tweet is racist and label 0 means its not.

For the sake of brevity, I will only be focusing on the key sections of the code. For the full code, please feel free to visit my GitHub Machine-learning repository. Since I have already cleaned up the dataset and performed the EDA(Exploratory Data Analysis), I will not be covering those details here either.

#data-science #artificial-intelligence #naturallanguageprocessing #nlp #machine-learning #deep learning

Which is the best NLP library?
1.25 GEEK