Introduction:

At the beginning, there was a simple problem. My manager came to me to ask if we could classify mails and associated documents with NLP methods.

Doesn’t sound very funky but I’ll start with thousands of sample. The first thing asked was to use “XGBoost” because: “We can do everything with XGBoost”. Interesting job if data science all comes down to XGBoost…

After implementing a notebook with different algorithms, metrics, and visualization something was still in my mind. I couldn’t select between the different models in one run. Simply because you can have luck with one model and don’t know how to reproduce the good accuracy, precision etc…

So at this moment, I ask myself, how to do a model selection? I looked on the web, read posts, articles, and so on. It was very interesting to see the different manners to implement this kind of thing. But, it was blurry when considering neural network. At this moment, I had one thing in mind, how to compare classical methods (multinomial Naïve Bayes, SVM, Logistic Regression, boosting …) and neural networks (shallow, deep, lstm, rnn, cnn…).

I present here a short explanation of the notebook. Comments are welcome.

_The notebook is available on GitHub: _here

_The notebook is available on Colab: _here

How to start?

Every project starts with an exploratory data analysis (**EDA **in short), followed directly by **preprocessing **(the texts were very dirty, signatures in emails, url, mails header etc…). The different functions will be presented in the Github repository.

A quick way to see if the preprocessing is correct is to determine the most common** n-grams** (uni, bi, tri… grams). Another post will guide you on this way.

Data

We will apply the method of model selection with IMDB dataset. If you are not familiar with the IMDB dataset, it’s a dataset containing movie reviews (text) for sentiment analysis (binary — positive or negative).

More details can be found here. To download it:

$ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

$ tar -xzf aclImdb_v1.tar.gz

Vectorizing methods

One-Hot encoding (Countvectorizing):

It’s the method where words will be replaced by vectors of 0’s and 1’s. The goal is to take a corpus (important volume of words) and make a vector of each unique word contained in the corpus. After, each word will be projected in this vector where 0 indicates non-existent while 1 indicates existent.

       | bird | cat | dog | monkey |
bird   |  1   |  0  |  0  |    0   |
cat    |  0   |  1  |  0  |    0   |
dog    |  0   |  0  |  1  |    0   |
monkey |  0   |  0  |  0  |    1   |

The corresponding python code:

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df[TEXT])  # text without stopwords

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

TF-IDF:

_Term Frequency-Inverse Document Frequency _is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus (source: tf-idf).

This method is powerful when dealing with an important number of stopwords (this type of word are not relevant for the information →_ I, me, my, myself, we, our, ours, ourselves, you_… for the English language). The IDF term permits to reveal the important words and rare words.

# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)
tfidf_vect.fit(df[TEXT])
xtrain_tfidf =  tfidf_vect.transform(train_x_sw)
xvalid_tfidf =  tfidf_vect.transform(valid_x_sw)

TF-IDF n-grams:

The difference with the previous tf-idf based on one word, the tf-idf n-grams take into account n successive words.

# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=10000)
tfidf_vect_ngram.fit(df[TEXT])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x_sw)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x_sw)

TF-IDF chars n-grams:

Same as the previous method but the focus is on character level, the method will focus on n successive characters.

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char',  ngram_range=(2,3), max_features=10000) 
tfidf_vect_ngram_chars.fit(df[TEXT])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x_sw) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x_sw)

#tutorial #model-selection #machine-learning #python3 #deep-learning #deep learning

Introduction:

How to start?

Data

Vectorizing methods

towardsdatascience.com

Model Selection in Text Classification