Text Classification in Python

Text Classification in Python

Text Classification in Python - Learn to build a text classification model in Python...

This article is the first of a series in which I will cover the whole process of developing a machine learning project.

In this article we focus on training a supervised learning text classificationmodel in Python.

The motivation behind writing these articles is the following: as a learning data scientist who has been working with data science tools and machine learning models for a fair amount of time, I’ve found out that many articles in the internet, books or literature in general strongly focus on the modeling part. That is, we are given a certain dataset (with the labels already assigned if it is a supervised learning problem), try several models and obtain a performance metric. And the process ends there.

But in real life problems, I think that finding the right model with the right hyperparameters is only the beginningof the task. What will happen when we deploy the model? How will it respond to new data? Will this data look the same as the training dataset? Perhaps, will there be some information (scaling or feature-related information) that we will need? Will it be available? Will the user allow and understand the uncertainty associated with the results? We have to ask ourselves these questions if we want to succeed at bringing a machine learning-based service to our final users.

For this reason, I have developed a project that covers this full process of creating a ML-based service: getting the raw data and parsing it, creating the features, training different models and choosing the best one, getting new data to feed the model and showing useful insights to the final user.

The project involves the creation of a real-time web application that gathers data from several newspapers and shows a summary of the different topics that are being discussed in the news articles.

This is achieved with a supervised machine learning classification modelthat is able to predict the category of a given news article, a **web scraping method **that gets the latest news from the newspapers, and an **interactive web application **that shows the obtained results to the user.

This can be seen as a **text classification **problem. Text classification is one of the widely used natural language processing (NLP) applications in different business problems.

These article is aimed to people that already have some understanding of the basic machine learning concepts (i.e. know what cross-validation is and when to use it, know the difference between Logistic and Linear Regression, etc…). However, I will briefly explain the different concepts involved in the project.

The github repo can be found here. It includes all the code and a complete report. I will not include the code in this post because it would be too large, but I will provide a link wherever it is needed.

I will divide the process in three different posts:

  • Classification model training (this post)
  • News articles web scraping (will be published soon)
  • App creation and deployment (will be published soon)

This post covers the first part: classification model training. We’ll cover it in the following steps:

  1. Problem definition and solution approach
  2. Input data
  3. Creation of the initial dataset
  4. Exploratory Data Analysis
  5. Feature Engineering
  6. Predictive Models

1. Problem definition and solution approach

As we have said, we are talking about a supervised learning problem. This means we need a labeled dataset so the algorithms can learn the patterns and correlations in the data. We fortunately have one available, but in real life problems this is a critical step since we normally have to do the task manually. Because, if we are able to automate the task of labeling some data points, then why would we need a classification model?

2. Input data

The dataset used in this project is the BBC News Raw Dataset. It can be downloaded from here.

It consists of 2.225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005. These areas are:

  • Business
  • Entertainment
  • Politics
  • Sport
  • Tech

The download file contains five folders (one for each category). Each folder has a single *.txt *file for every news article. These files include the news articles body in raw text.

3. Creation of the initial dataset

The aim of this step is to get a dataset with the following structure:

We have created this dataset with an R script, because the package *readtext *simplifies a lot this procedure. The script can be found here.

4. Exploratory Data Analysis

It is a common practice to carry out an exploratory data analysis in order to gain some insights from the data. However, up to this point, we don’t have any features that define our data. We will see how to create features from text in the next section (5. Feature Engineering), but, because of the way these features are constructed, we would not expect any valuable insights from analyzing them. For this reason, we have only performed a shallow analysis.

One of our main concerns when developing a classification model is whether the different classes are balanced. This means that the dataset contains an approximately equal portion of each class.

For example, if we had two classes and a 95% of observations belonging to one of them, a dumb classifier which always output the majority class would have 95% accuracy, although it would fail all the predictions of the minority class.

There are several ways of dealing with imbalanced datasets. One first approach is to undersample the majority class and oversample the minority one, so as to obtain a more balanced dataset. Other approach can be using other error metrics beyond accuracy such as the precision, the recall or the F1-score. We’ll talk more about these metrics later.

Looking at our data, we can get the % of observations belonging to each class:

We can see that the classes are approximately balanced, so we won’t perform any undersampling or oversampling method. However, we will anyway use precision and recall to evaluate model performance.

Another variable of interest can be the length of the news articles. We can obtain the length distribution across categories:

We can see that politics and tech articles tend to be longer, but not in a significant way. In addition, we will see in the next section that the length of the articles is taken into account and corrected by the method we use to create the features. So this should not matter too much to us.

The EDA notebook can be found here.

5. Feature Engineering

Feature engineering is an essential part of building any intelligent system. As Andrew Ng says:

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Feature engineering is the process of transforming data into features to act as inputs for machine learning models such that good quality features help in improving the model performance.

When dealing with text data, there are several ways of obtaining features that represent the data. We will cover some of the most common methods and then choose the most suitable for our needs.

5.1. Text representation

Recall that, in order to represent our text, every row of the dataset will be a single document of the corpus. The columns (features) will be different depending of which feature creation method we choose:

  • Word Count Vectors

With this method, every column is a term from the corpus, and every cell represents the frequency count of each term in each document.

  • TF–IDF Vectors

TF-IDF is a score that represents the relative importance of a term in the document and the entire corpus. *TF *stands for Term Frequency, and IDF stands for Inverse Document Frequency:

The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general**.**

It also takes into account the fact that some documents may be larger than others by normalizing the *TF *term (expressing instead relative term frequencies).

These two methods (Word Count Vectors and TF-IDF Vectors) are often named Bag of Words methods, since the order of the words in a sentence is ignored. The following methods are more advanced as they somehow preserve the order of the words and their lexical considerations.

  • Word Embeddings

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be used with pre-trained models applying transfer learning.

  • Text based or NLP based features

We can manually create any feature that we think may be of importance when discerning between categories (i.e. word density, number of characters or words, etc…).

We can also use NLP based features using Part of Speech models, which can tell us, for example, if a word is a noun or a verb, and then use the frequency distribution of the PoS tags.

  • Topic Models

Methods such as Latent Dirichlet Allocation try to represent every topic by a probabilistic distribution over words, in what is known as topic modeling.

We have chosen *TF-IDF *vectors to represent the documents in our corpus. This election is motivated by the following points:

  • TF-IDF is a simple model that yields great results in this particular domain, as we will see later.

  • TF-IDF features creation is a fast process, which will lead us to shorter waiting time for the user when using the web application.

  • We can tune the feature creation process (see next paragraph) to avoid issues like overfitting.

When creating the features with this method, we can choose some parameters:

  • N-gram range: we are able to consider unigrams, bigrams, trigrams…

  • Maximum/Minimum Document Frequency: when building the vocabulary, we can ignore terms that have a document frequency strictly higher/lower than the given threshold.

  • Maximum features: we can choose the top N features ordered by term frequency across the corpus.

We have chosen the following parameters:

We expect that bigrams help to improve our model performance by taking into consideration words that tend to appear together in the documents. We have chosen a value of Minimum DF equal to 10 to get rid of extremely rare words that don’t appear in more than 10 documents, and a Maximum DF equal to 100% to not ignore any other words. The election of 300 as maximum number of features has been made because we want to avoid possible overfitting, often arising from a large number of features compared to the number of training observations.

As we will see in the next sections, these values lead us to really high accuracy values, so we will stick to them. However, these parameters could be tuned in order to train better models.

There is one important consideration that needs to be mentioned. Recall that the calculation of TF-IDF scores needs the presence of a corpus of documents to compute the Inverse Document Frequency term. For this reason, if we wanted to predict a single news article at a time (for example once the model is deployed), we would need to define that corpus.

This corpus is the set of training documents. Consequently, when obtaining TF-IDF features from a new article, only the features that existed in the training corpus will be created for this new article.

It is straight to conclude that the more similar the training corpus is to the news that we are going to be scraping when the model is deployed, the more accuracy we will presumably get.

5.2. Text cleaning

Before creating any feature from the raw text, we must perform a cleaning process to ensure no distortions are introduced to the model. We have followed these steps:

  • Special character cleaning:special characters such as “\n” double quotes must be removed from the text since we aren’t expecting any predicting power from them.

  • Upcase/downcase: we would expect, for example, “Book” and “book” to be the same word and have the same predicting power. For that reason we have downcased every word.

  • Punctuation signs: characters such as “?”, “!”, “;” have been removed.

  • Possessive pronouns: in addition, we would expect that “Trump” and “Trump’s” had the same predicting power.

  • Stemming or Lemmatization: stemming is the process of reducing derived words to their root. Lemmatization is the process of reducing a word to its lemma. The main difference between both methods is that lemmatization provides existing words, whereas stemming provides the root, which may not be an existing word. We have used a Lemmatizer based in WordNet.

  • Stop words: words such as “what” or “the” won’t have any predicting power since they will presumably be common to all the documents. For this reason, they may represent noise that can be eliminated. We have downloaded a list of English stop words from the nltk package and then deleted them from the corpus.

There is one important consideration that must be made at this point. We should take into account possible distortions that are not only present in the training test, but also in the news articles that will be scraped when running the web application.

5.3. Label coding

Machine learning models require numeric features and labels to provide a prediction. For this reason we must create a dictionary to map each label to a numerical ID. We have created this mapping scheme:

5.4. Train — test split

We need to set apart a test set in order to prove the quality of our models when predicting unseen data. We have chosen a random split with 85% of the observations composing the training test and 15% of the observations composing the test set. We will perform the hyperparameter tuning process with cross validation in the training data, fit the final model to it and then evaluate it with totally unseen data so as to obtain an evaluation metric as less biased as possible.

The complete and detailed feature engineering code can be found here.

6. Predictive Models

6.1. Hyperparameter tuning methodology and models

We have tested several machine learning models to figure out which one may fit better to the data and properly capture the relationships across the points and their labels. We have only used classic machine learning models instead of deep learning models because of the insufficient amount of data we have, which would probably lead to overfit models that don’t generalize well on unseen data.

We have tried the following models:

  • Random Forest
  • Support Vector Machine
  • K Nearest Neighbors
  • Multinomial Naïve Bayes
  • Multinomial Logistic Regression
  • Gradient Boosting

Each one of them has multiple hyperparameters that also need to be tuned. We have followed the following methodology when defining the best set of hyperparameters for each model:

Firstly, we have decided which hyperparameters we want to tune for each model, taking into account the ones that may have more influence in the model behavior, and considering that a high number of parameters would require a lot of computational time.

Then, we have defined a grid of possible values and performed a Randomized Search using 3-Fold Cross Validation (with 50 iterations). Finally, once we get the model with the best hyperparameters, we have performed a Grid Search using 3-Fold Cross Validation centered in those values in order to exhaustively search in the hyperparameter space for the best performing combination.

We have followed this methodology because with the randomized search we can cover a much wider range of values for each hyperparameter without incurring in really high execution time. Once we narrow down the range for each one, we know where to concentrate our search and explicitly specify every combination of settings to try.

The reason behind choosing 𝐾 = 3 as the number of folds and 50 iterations in the randomized search comes from the trade-off between shorter execution time or testing a high number of combinations. When choosing the best model in the process, we have chosen the accuracy as the evaluation metric.

6.2. Performance Measurement

After performing the hyperparameter tuning process with the training data via cross validation and fitting the model to this training data, we need to evaluate its performance on totally unseen data (the test set). When dealing with classification problems, there are several metrics that can be used to gain insights on how the model is performing. Some of them are:

  • Accuracy: the accuracy metric measures the ratio of correct predictions over the total number of instances evaluated.

  • Precision: precision is used to measure the positive patterns that are correctly predicted from the total predicted patterns in a positive class.

  • Recall: recall is used to measure the fraction of positive patterns that are correctly classified

  • F1-Score: this metric represents the harmonic mean between recall and precision values

  • Area Under the ROC Curve (AUC): this is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much a model is capable of distinguishing between classes.

These metrics are highly extended an widely used in binary classification. However, when dealing with multiclass classification they become more complex to compute and less interpretable. In addition, in this particular application, we just want documents to be correctly predicted. The costs of false positives or false negatives are the same to us. For this reason, it does not matter to us whether our classifier is more specific or more sensitive, as long as it classifies correctly as much documents as possible. Therefore, we have studied the accuracy when comparing models and when choosing the best hyperparameters. In the first case, we have calculated the accuracy on both training and test sets so as to detect overfit models. However, we have also obtained the confusion matrix and the classification report (which computes precision, recall and F1-score for all the classes) for every model, so we could further interpret their behavior.

6.3. Best Model Selection

Below we show a summary of the different models and their evaluation metrics:

Overall, we obtain really good accuracy values for every model. We can observe that the Gradient Boosting, Logistic Regression and Random Forest models seem to be overfit since they have an extremely high training set accuracy but a lower test set accuracy, so we’ll discard them.

We will choose the SVM classifier above the remaining models because it has the highest test set accuracy, which is really near to the training set accuracy. The confusion matrix and the classification report of the SVM model are the following:

6.4. Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. Now, we will study its behavior by analyzing misclassified articles, in order to get some insights on the way the model is working and, if necessary, think of new features to add to the model. Recall that, although the hyperparameter tuning is an important process, the most critic process when developing a machine learning project is being able to extract good features from the data.

Let’s show an example of a misclassified article. Its actual category is politics, although the model predicted tech.

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”> “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”> “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”> “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
This article talks about the prohibition of Blackberry mobiles in the Commons chamber. It involves both politics and tech, so the misclassification makes sense.

6.5. Dimensionality reduction plots

Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely.

There are many applications of dimensionality reduction techniques in machine learning. One of them is visualization. By reducing the dimensional space to 2 or 3 dimensions that contain a great part of the information, we can plot our data points and be able to recognize some patterns as humans.

We have used two different techniques for dimensionality reduction:

  • Principal Component Analysis: this technique relies on the obtention of the eigenvalues and eigenvectors of the data matrix and tries to provide a minimum number of variables that keep the maximum amount of variance.

  • t-SNE: the t-distributed Stochastic Neighbor Embedding is a probabilistic technique particularly well suited for the visualization of high-dimensional datasets. It minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding.

Let’s plot the results:

We can see that using the t-SNE technique makes it easier to distinguish the different classes. Although we have only used dimensionality reduction techniques for plotting purposes, we could have used them to shrink the number of features to feed our models. This approach is particularly useful in text classification problems due to the commonly large number of features.

6.6. Predicted Conditional Probabilities

We have to make an additional consideration before stepping into the web scraping process. The training dataset has articles labeled as Business, Entertainment, Sports, Tech and Politics. But we could think of news articles that don’t fit into any of them (i.e. a weather news article). Since we have developed a supervised learning model, these kind of articles would be wrongly classified into one of the 5 classes.

In addition, since our training dataset is dated of 2004–2005, there may be a lot of new concepts (for example, technological ones) that will appear when scraping the latest articles, but won’t be present in the training data. Again, we expect poor predicting power in these cases.

A lot of classification models provide not only the class to which some data point belongs. They can also provide the conditional probability of belonging to the class 𝐶.

When we have an article that clearly talks, for example, about politics, we expect that the conditional probability of belonging to the Politics class is very high, and the other 4 conditional probabilities should be very low.

But when we have an article that talks about the weather, we expect all the conditional probability vector’s values to be equally low.

Therefore, we can specify a threshold with this idea: if the highest conditional probability is lower than the threshold, we will provide no predicted label for the article. If it is higher, we will assign the corresponding label.

After a brief study exploring different articles that may not belong to any of the 5 categories, we have fixed that threshold at 65%.

For further detail on all the steps of the model training process, please visit this link.

At this point, we have trained a model that will be able to classify news articles that we feed into it. We are a step closer to building our application!

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

  • Introduction To Data Science And Machine Learning
  • Why Use Python For Data Science And Machine Learning?
  • Python Libraries for Data Science And Machine Learning
  • Python libraries for Statistics
  • Python libraries for Visualization
  • Python libraries for Machine Learning
  • Python libraries for Deep Learning
  • Python libraries for Natural Language Processing

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

PyTorch Tutorial for Beginners

The Pandas Library for Python

Introduction To Data Analytics With Pandas

Python Programming for Data Science and Machine Learning

Python Programming for Data Science and Machine Learning

This article provides an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at dzone.com

Python is a general-purpose, high-level, object-oriented, and easy to learn programming language. It was created by Guido van Rossum who is known as the godfather of Python.

Python is a popular programming language because of its simplicity, ease of use, open source licensing, and accessibility — the foundation of its renowned community, which provides great support and help in creating tons of packages, tutorials, and sample programs.

Python can be used to develop a wide variety of applications — ranging from Web, Desktop GUI based programs/applications to science and mathematics programs, and Machine learning and other big data computing systems.

Let’s explore the use of Python in Machine Learning, Data Science, and Data Engineering.

Machine Learning

Machine learning is a relatively new and evolving system development paradigm that has quickly become a mandatory requirement for companies and programmers to understand and use. See our previous article on Machine Learning for the background. Due to the complex, scientific computing nature of machine learning applications, Python is considered the most suitable programming language. This is because of its extensive and mature collection of mathematics and statistics libraries, extensibility, ease of use and wide adoption within the scientific community. As a result, Python has become the recommended programming language for machine learning systems development.

Data Science

Data science combines cutting edge computer and storage technologies with data representation and transformation algorithms and scientific methodology to develop solutions for a variety of complex data analysis problems encompassing raw and structured data in any format. A Data Scientist possesses knowledge of solutions to various classes of data-oriented problems and expertise in applying the necessary algorithms, statistics, and mathematic models, to create the required solutions. Python is recognized among the most effective and popular tools for solving data science related problems.

Data Engineering

Data Engineers build the foundations for Data Science and Machine Learning systems and solutions. Data Engineers are technology experts who start with the requirements identified by the data scientist. These requirements drive the development of data platforms that leverage complex data extraction, loading, and transformation to deliver structured datasets that allow the Data Scientist to focus on solving the business problem. Again, Python is an essential tool in the Data Engineer’s toolbox — one that is used every day to architect and operate the big data infrastructure that is leveraged by the data scientist.

Use Cases for Python, Data Science, and Machine Learning

Here are some example Data Science and Machine Learning applications that leverage Python.

  • Netflix uses data science to understand user viewing pattern and behavioral drivers. This, in turn, helps Netflix to understand user likes/dislikes and predict and suggest relevant items to view.
  • Amazon, Walmart, and Target are heavily using data science, data mining and machine learning to understand users preference and shopping behavior. This assists in both predicting demands to drive inventory management and to suggest relevant products to online users or via email marketing.
  • Spotify uses data science and machine learning to make music recommendations to its users.
  • Spam programs are making use of data science and machine learning algorithm(s) to detect and prevent spam emails.

This article provided an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at dzone.com


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Jupyter Notebook for Data Science

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide

A “Data Science for Good“ Machine Learning Project Walk-Through in Python

A “Data Science for Good“ Machine Learning Project Walk-Through in Python

In this article and the sequel, we’ll walk through a complete machine learning project on a “Data Science for Good” problem: predicting household poverty in Costa Rica. Not only do we get to improve our data science skills in the most effective manner — through practice on real-world data — but we also get the reward of working on a problem with social benefits.

In this article and the sequel, we’ll walk through a complete machine learning project on a “Data Science for Good” problem: predicting household poverty in Costa Rica. Not only do we get to improve our data science skills in the most effective manner — through practice on real-world data — but we also get the reward of working on a problem with social benefits.

A “Data Science for Good“ Machine Learning Project Walk-Through in Python: Part One: Solving a complete machine learning problem for societal benefit

Data science is an immensely powerful tool in our data-driven world. Call me idealistic, but I believe this tool should be used for more than getting people to click on ads or spend more time consumed by social media.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

Problem and Approach

The Costa Rican Household Poverty Level Prediction challenge is a data science for good machine learning competition currently running on Kaggle. The objective is to use individual and household socio-economic indicators to predict poverty on a household basis. IDB, the Inter-American Development Bank, developed the problem and provided the data with the goal of improving upon traditional methods for identifying families at need of aid.

The Costa Rican Poverty Prediction contest is currently running on Kaggle.

The poverty labels fall into four levels making this a supervised multi-class classification problem:

  • Supervised: given the labels for the training data
  • Multi-Class Classification: labels are discrete with more than 2 values

The general approach to a machine learning problem is:

  1. Understand the problem and data descriptions
  2. Data cleaning / exploratory data analysis
  3. Feature engineering / feature selection
  4. Model comparison
  5. Model optimization
  6. Interpretation of results

While these steps may seem to present a rigid structure, the machine learning process is non-linear, with parts repeated multiple times as we get more familiar with the data and see what works. It’s nice to have an outline to provide a general guide, but we’ll often return to earlier parts of the process if things aren’t working out or as we learn more about the problem.

We’ll go through the first four steps at a high-level in this article, taking a look at some examples, with the full details available in the notebooks. This problem is a great one to tackle both for beginners — because the dataset is manageable in size — and for those who already have a firm footing because Kaggle offers an ideal environment for experimenting with new techniques.

Understanding the Problem and Data

In an ideal situation, we’d all be experts in the problem subject with years of experience to inform our machine learning. In reality, we often work with data from a new field and have to rapidly acquire knowledge both of what the data represents and how it was collected.

Fortunately, on Kaggle, we can use the work shared by other data scientists to get up to speed relatively quickly. Moreover, Kaggle provides a discussion platform where you can ask questions of the competition organizers. While not exactly the same as interacting with customers at a real job, this gives us an opportunity to figure out what the data fields represent and any considerations we should keep in mind as we get into the problem.

Some good questions to ask at this point are:

  • Supervised: given the labels for the training data
  • Multi-Class Classification: labels are discrete with more than 2 values

For example, after engaging in discussions with the organizers, the community found out the text string “yes” actually maps to the value 1.0 and that the maximum value in one of the columns should be 5 which can be used to correct outliers. We would have been hard-pressed to find out this information without someone who knows the data collection process!

Part of data understanding also means digging into the data definitions. The most effective way is literally to go through the columns one at a time, reading the description and making sure you know what the data represents. I find this a little dull, so I like to mix this process with data exploration, reading the column description and then exploring the column with stats and figures.

For example, we can read that meaneduc is the average amount of education in the family, and then we can plot it distributed by the value of the label to see if it has any noticeable differences between the poverty level .

Average schooling in family by target (poverty level).

This shows that families the least at risk for poverty — non-vulnerable — tend to have higher average education levels than those most at risk. Later in feature engineering, we can use this information by building features from the education since it seems to show a different between the target labels.

There are a total of 143 columns (features), and while for a real application, you want to go through each with an expert, I didn’t exhaustively explore all of these in the notebook. Instead, I read the data definitions and looked at the work of other data scientists to understand most of the columns.

Another point to establish from the problem and data understanding stage is how we want to structure our training data. In this problem, we’re given a single table of data where each row represents an individual and the columns are the features. If we read the problem definition, we are told to make predictions for each household which means that our final training dataframe (and also testing) should have one row for each house. This point informs our entire pipeline, so it’s crucial to grasp at the outset.

A snapshot of the data where each row is one individual.

Determine the Metric

Finally, we want to make sure we understanding the labels and the metric for the problem. The label is what we want to predict, and the metric is how we’ll evaluate those predictions. For this problem, the label is an integer, from 1 to 4, representing the poverty level of a household. The metric is the Macro F1 Score, a measure between 0 and 1 with a higher value indicating a better model.** **The F1 score is a common metric for binary classification tasks and “Macro” is one of the averaging options for multi-class problems.

Once you know the metric, figure out how to calculate it with whatever tool you are using. For Scikit-Learn and the Macro F1 score, the code is:

from sklearn.metrics import f1_score
# Code to compute metric on predictions
score = f1_score(y_true, y_prediction, average = 'macro')

Knowing the metric allows us to assess our predictions in cross validation and using a hold-out testing set, so we know what effect, if any, our choices have on performance. For this competition, we are given the metric to use, but in a real-world situation, we’d have to choose an appropriate measure ourselves.

Data Exploration and Data Cleaning

Data exploration, also called Exploratory Data Analysis (EDA), is an open-ended process where we figure out what our data can tell us. We start broad and gradually hone in our analysis as we discover interesting trends / patterns that can be used for feature engineering or find anomalies. Data cleaning goes hand in hand with exploration because we need to address missing values or anomalies as we find them before we can do modeling.

For an easy first step of data exploration, we can visualize the distribution of the labels for the training data (we are not given the testing labels).

Distribution of training labels.

Right away this tells us we have an imbalanced classification problem, which can make it difficult for machine learning models to learn the underrepresented classes. Many algorithms have ways to try and deal with this, such as setting class_weight = "balanced" in the Scikit-Learn random forest classifier although they don’t work perfectly. We also want to make sure to use stratified sampling with cross validation when we have an imbalanced classification problem to get the same balance of labels in each fold.

To get familiar with the data, it’s helpful to go through the different column data types which represent different statistical types of data:

  • Supervised: given the labels for the training data
  • Multi-Class Classification: labels are discrete with more than 2 values

I’m using *statistical type *to mean what the data represents — for example a Boolean that can only be 1 or 0 — and *data type *to mean the actual way the values are stored in Python such as integers or floats. The statistical type informs how we handle the columns for feature engineering.

(I specified *usually *for each data type / statistical type pairing because you may find that statistical types are saved as the wrong data type.)

If we look at the integer columns for this problem, we can see that most of them represent Booleans because there are only two possible values:

Integer columns in data.

Going through the object columns, we are presented with a puzzle: 2 of the columns are Id variables (stored as strings), but 3 look to be numeric values.

# Train is pandas dataframe of training data

Object columns in original data.

This is where our earlier data understanding comes into play. For these three columns, some entries are “yes” and some are “no” while the rest are floats. We did our background research and thus know that a “yes” means 1 and a “no” means 0. Using this information, we can correct the values and then visualize the variable distributions colored by the label.

Distribution of corrected variables by the target label.

This is a great example of data exploration and cleaning going hand in hand. We find something incorrect with the data, fix it, and then explore the data to make sure our correction was appropriate.

Missing Values

A critical data cleaning operation for this data is handling missing values. To calculate the total and percent of missing values is simple in Pandas:

import pandas as pd
# Number of missing in each column
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'})
# Create a percentage missing
missing['percent'] = missing['total'] / len(data)

Missing values in data.

In some cases there are reasons for missing values: the v2a1 column represents monthly rent and many of the missing values are because the household owns the home. To figure this out, we can subset the data to houses missing the rent payment and then plot the tipo_ variables (I’m not sure where these column names come from) which show home ownership.

Home ownership status for those households with no rent payments.

Based on the plot, the solution is to fill in the missing rent payments for households that own their house with 0 and leave the others to be imputed. We also add a boolean column that indicates if the rent payment was missing.

The other missing values in the columns are dealt with the same way: using knowledge from other columns or about the problem to fill in the values, or leaving them to be imputed. Adding a boolean column to indicate missing values can also be useful because sometimes the *information that a value was missing *is important. Another crucial point to note is that for missing values, we often want to think about using information in other columns to fill in missing values such as we did with the rent payment.

Once we’ve handled the missing values, anomalies, and incorrect data types, we can move on to feature engineering. I usually view data exploration as an ongoing process rather than one set chunk. For example, as we get into feature engineering, we might want to explore the new variables we create.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Feature Engineering

If you follow my work, you’ll know I’m convinced automated feature engineering — with domain expertise — will take the place of traditional manual feature engineering. For this problem, I took both approaches, doing mostly manual work in the main notebook, and then writing another notebook with automated feature engineering. Not surprisingly, the automated feature engineering took one tenth the time and achieved better performance! Here I’ll show the manual version, but keep in mind that automated feature engineering (with Featuretools) is a great tool to learn.

In this problem, our primary objective for feature engineering is to aggregate all the individual level data at the household level. That means grouping together the individuals from one house and then calculating statistics such as the maximum age, the average level of education, or the total number of cellphones owned by the family.

Fortunately, once we have separated out the individual data (into the ind dataframe), doing these aggregations is literally one line in Pandas (with idhogar the household identifier used for grouping):

# Aggregate individual data for each household
ind_agg = ind.groupby('idhogar').agg(['min', 'max', 'mean', 'sum'])

After renaming the columns, we have a lot of features that look like:

Features produced by aggregation of individual data.

The benefit of this method is that it quickly creates many features. One of the drawbacks is that many of these features might not be useful or are highly correlated (called collinear) which is why we need to use feature selection.

An alternative method to aggregations is to calculate features one at a time using domain knowledge based on what features might be useful for predicting poverty. For example, in the household data, we create a feature called warning which adds up a number of household “warning signs” ( house is a dataframe of the household variables):

# No toilet, no electricity, no floor, no water service, no ceiling
house['warning'] = 1 * (house['sanitario1'] + 
                         (house['elec'] == 0) + 
                         house['pisonotiene'] + 
                         house['abastaguano'] + 
                         (house['cielorazo'] == 0))

Violinplot of Target by Warning Value.

We can also calculate “per capita” features by dividing one value by another ( tamviv is the number of household members):

# Per capita features for household data
house['phones-per-capita'] = house['qmobilephone'] / house['tamviv']
house['tablets-per-capita'] = house['v18q1'] / house['tamviv']
house['rooms-per-capita'] = house['rooms'] / house['tamviv']
house['rent-per-capita'] = house['v2a1'] / house['tamviv']

When it comes to manual vs automated feature engineering, I think the optimal answer is a blend of both. As humans, we are limited in the features we build both by creativity — there are only so many features we can think to make — and time — there is only so much time for us to write the code. We can make a few informed features like those above by hand, but where automated feature engineering excels is when doing aggregations that can automatically build on top of other features.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
(Featuretools is the most advanced open-source Python library for automated feature engineering. Here’s an article to get you started in about 10 minutes.)

Feature Selection

Once we have exhausted our time or patience making features, we apply feature selection to remove some features, trying to keep only those that are useful for the problem. “Useful” has no set definition, but there are some heuristics (rules of thumb) that we use to select features.

One method is by determining correlations between features. Two variables that are highly correlated with one another are called collinear. These are a problem in machine learning because they slow down training, create less interpretable models, and can decrease model performance by causing overfitting on the training data.

The tricky part about removing correlated features is determining the threshold of correlation for saying that two variables are too correlated. I generally try to stay conservative, using a correlation coefficient in the 0.95 or above range. Once we decide on a threshold, we use the below code to remove one out of every pair of variables with a correlation above this value:

import numpy as np
threshold = 0.95
# Create correlation matrix
corr_matrix = data.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > threshold)]
data = data.drop(columns = to_drop)

We are only removing features that are correlated with one another. We want features that are correlated with the target(although a correlation of greater than 0.95 with the label would be too good to be true)!

There are many methods for feature selection (we’ll see another one in the experimental section near the end of the article). These can be univariate — measuring one variable at a time against the target — or multivariate — assessing the effects of multiple features. I also tend to use model-based feature importances for feature selection, such as those from a random forest.

After feature selection, we can do some exploration of our final set of variables, including making a correlation heatmap and a pairsplot.

Correlation heatmap (left) and pairsplot colored by the value of the label (right).

One point we get from the exploration is the relationship between education and poverty: as the education of a household increases (both the average and the maximum), the severity of poverty tends to decreases (1 is most severe):

Max schooling of the house by target value.

On the other hand, as the level of overcrowding — the number of people per room — increases, the severity of the poverty increases:

Household overcrowding by value of the target.

These are two actionable insights from this competition, even before we get to the machine learning: households with greater levels of education tend to have less severe poverty, and households with more people per room tend to have greater levels of poverty. I like to think about the ramifications and larger picture of a data science project in addition to the technical aspects. It can be easy to get overwhelmed with the details and then forget the overall reason you’re working on this problem.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Model Comparison

The following graph is one of my favorite results in machine learning: it displays the performance of machine learning models on many datasets, with the percentages showing how many times a particular method beat any others. (This is from a highly readable paper by Randal Olson.)

Comparison of many algorithms on 165 datasets.

What this shows is that there are some problems where even a simple Logistic Regression will beat a Random Forest or Gradient Boosting Machine. Although the Gradient Tree Boosting model generally works the best, it’s not a given that it will come out on top. Therefore, when we approach a new problem, the best practice is to try out several different algorithms rather than always relying on the same one. I’ve gotten stuck using the same model (random forest) before, but remember that no one model is always the best.

Fortunately, with Scikit-Learn, it’s easy to evaluate many machine learning models using the same syntax. While we won’t do hyperparameter tuning for each one, we can compare the models with the default hyperparameters in order to select the most promising model for optimization.

In the notebook, we try out six models spanning the range of complexity from simple — Gaussian Naive Bayes — to complex — Random Forest and Gradient Boosting Machine. Although Scikit-Learn does have a GBM implementation, it’s fairly slow and a better option is to use one of the dedicated libraries such as XGBoost or LightGBM. For this notebook, I used Light GBM and choose the hyperparameters based on what have worked well in the past.

To compare models, we calculate the cross validation performance on the training data over 5 or 10 folds. We want to use the training data because the testing data is only meant to be used once as an estimate of the performance of our final model on new data. The following plot shows the model comparison. The height of the bar is the average Macro F1 score over the folds recorded by the model and the black bar is the standard deviation:

Model cross validation comparison results.

(To see an explanation of the names, refer to the notebook. RF stands for Random Forest and GBM is Gradient Boosting Machine with SEL representing the feature set after feature selection). While this isn’t entirely a level comparison — I did not use the default hyperparameters for the Gradient Boosting Machine — the general results hold: the GBM is the best model by a large margin. This reflects the findings of most other data scientists.

Notice that we cross-validated the data before and after feature selection to see its effect on performance. Machine learning is still largely an empirical field, and the only way to know if a method is effective is to try it out and then measure performance. It’s important to test out different choices for the steps in the pipeline — such as the correlation threshold for feature selection — to determine if they help. Keep in mind that we also want to avoid placing too much weight on cross-validation results, because even with many folds, we can still be overfitting to the training data. Finally, even though the GBM was best for this dataset, that will not always be the case!

Based on these results, we can choose the gradient boosting machine as our model (remember this is a decision we can go back and revise!). Once we decide on a model, the next step is to get the most out of it, a process known as model hyperparameter optimization.

Recognizing that not everyone has time for a 30-minute article (even on data science) in one sitting, I’ve broken this up into two parts. The second part covers model optimization, interpretation, and an experimental section.

Decision tree visualization from part two.


By this point, we can see how all the different parts of machine learning come together to form a solution: we first had to understand the problem, then we dug into the data, cleaning it as necessary, then we made features for a machine learning model, and finally we evaluated several different models.

We’ve covered many techniques and have a decent model (although the F1 score is relatively low, it places in the top 50 models submitted to the competition). Nonetheless, we still have a few steps left: through optimization, we can improve our model, and then we have to interpret our results because no analysis is complete until we’ve communicated our work.

A “Data Science for Good” Machine Learning Project Walk-Through in Python: Part Two: Getting the most from our model, figuring out what it all means, and experimenting with new techniques

Machine learning is a powerful framework that from the outside may look complex and intimidating. However, once we break down a problem into its component steps, we see that machine learning is really only a sequence of understandable processes, each one simple by itself.

In the first half of this series, we saw how we could implement a solution to a “data science for good” machine learning problem, leaving off after we had selected the Gradient Boosting Machine as our model of choice.

Model evaluation results from part one.

In this article, we’ll continue with our pipeline for predicting poverty in Costa Rica, performing model optimizing, interpreting the model, and trying out some experimental techniques.

The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

Model Optimization

Model optimization means searching for the model hyperparameters that yield the best performance — measured in cross-validation — for a given dataset. Because the optimal hyperparameters vary depending on the data, we have to optimize — also known as tuning — the model for our data. I like to think of tuning as finding the best settings for a machine learning model.

There are 4 main methods for tuning, ranked from least efficient (manual) to most efficient (automated).

  1. Understand the problem and data descriptions
  2. Data cleaning / exploratory data analysis
  3. Feature engineering / feature selection
  4. Model comparison
  5. Model optimization
  6. Interpretation of results

Naturally, we’ll skip the first three methods and move right to the most efficient: automated hyperparameter tuning. For this implementation, we can use the Hyperopt library, which does optimization using a version of Bayesian Optimization with the Tree Parzen Estimator. You don’t need to understand these terms to use the model, although I did write a conceptual explanation here. (I also wrote an article for using Hyperopt for model tuning here.)

The details are a little protracted (see the notebook), but we need 4 parts for implementing Bayesian Optimization in Hyperopt

  1. Understand the problem and data descriptions
  2. Data cleaning / exploratory data analysis
  3. Feature engineering / feature selection
  4. Model comparison
  5. Model optimization
  6. Interpretation of results

The basic idea of Bayesian Optimization (BO) is that the algorithm reasons from the past results — how well previous hyperparameters have scored — and then chooses the *next *combination of values it thinks will do best. Grid or random search are *uninformed *methods that don’t use past results and the idea is that by reasoning, BO can find better values in fewer search iterations.

See the notebook for the complete implementation, but below are the optimization scores plotted over 100 search iterations.

Model optimization scores versus iteration.

Unlike in random search where the scores are, well random over time, in Bayesian Optimization, the scores tend to improve over time as the algorithm learns a probability model of the best hyperparameters. The idea of Bayesian Optimization is that we can optimize our model (or any function) quicker by focusing the search on promising settings. Once the optimization has finished running, we can use the best hyperparameters to cross validate the model.

Optimizing the model will not always improve our test score because we are optimizing for the *training *data. However, sometimes it can deliver a large benefit compared to the default hyperparameters. In this case, the final cross validation results are shown below in dataframe form:

Cross validation results. Models without 10Fold in name were validated with 5 folds. SEL is selected features.

The optimized model (denoted by OPT and using 10 cross validation folds with the features after selection) places right in the middle of the non-optimized variations of the Gradient Boosting Machine (which used hyperparameters I had found worked well for previous problems.) This indicates we haven’t found the optimal hyperparameters yet, or there could be multiple sets of hyperparameters that performly roughly the same.

We can continue optimization to try and find even better hyperparameters, but usually the return to hyperparameter tuning is much less than the return to feature engineering. At this point we have a relatively high-performing model and we can use this model to make predictions on the test data. Then, since this is a Kaggle competition, we can submit the predictions to the leaderboard. Doing this gets us into the top 50 (at the moment) which is a nice vindication of all our hard work!

At this point, we have implemented a complete solution to this machine learning problem. Our model can make reasonably accurate predictions of poverty in Costa Rican households (the F1 score is relatively low, but this is a difficult problem). Now, we can move on to interpreting our predictions and see if our model can teach us anything about the problem. Even though we have a solution, we don’t want to lose sight of why our solution matters.

Note about Kaggle Competitions

The very nature of machine learning competitions can encourage bad practices, such as the mistake of optimizing for the leaderboard score at the cost of all other considerations. Generally this leads to using ever more complex models to eke out a tiny performance gain.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
A simple model that is put in use is better than a complex model which can never be deployed. Moreover, those at the top of the leaderboard are probably overfitting to the testing data and do not have a robust model.
It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Interpret Model Results

In the midst of writing all the machine learning code, it can be easy to lose sight of the important questions: what are we making this model for? What will be the impact of our predictions? Thankfully, our answer this time isn’t “increasing ad revenue” but, instead, effectively predicting which households are most at risk for poverty in Costa Rica so they can receive needed help.

To try and get a sense of our model’s output, we can examine the prediction of poverty levels on a household basis for the test data. For the test data, we don’t know the true answers, but we can compare the relative frequency of each predicted class with that in the training labels. The image below shows the training distribution of poverty on the left, and the predicted distribution for the testing data on the right:

Training label distribution (left) and predicted test distribution (right). Both histograms are normalized.

Intriguingly, even though the label “not vulnerable” is most prevalent in the training data, it is represented less often on a relative basis for the predictions. Our model predicts a higher proportion of the other 3 classes, which means that it thinks there is more severe poverty in the testing data. If we convert these fractions to numbers, we have 3929 households in the “non vulnerable” category and 771 households in the “extreme” category.

Another way to look at the predictions is by the confidence of the model. For each prediction on the test data, we can see not only the label, but also the probability given to it by the model. Let’s take a look at the confidence by the value of the label in a boxplot.

Boxplot of probability assigned to each label on testing data.

These results are fairly intuitive — our model is most confident in the most extreme predictions — and less confident in the moderate ones. Theoretically, there should be more separation between the most extreme labels and the targets in the middle should be more difficult to tease apart.

Another point to draw from this graph is that overall, our model is not very sure of the predictions. A guess with no data would place 0.25 probability on each class, and we can see that even for the least extreme poverty, our model rarely has more than 40% confidence. What this tells us is this is a tough problem — there is not much to separate the classes in the available data.

Ideally, these predictions, or those from the winning model in the competition, will be used to determine which families are most likely to need assistance. However, just the predictions alone do not tell us what may lead to the poverty or how our model “thinks”. While we can’t completely solve this problem yet, we can try to peer into the black box of machine learning.

In a tree-based model — such as the Gradient Boosting Machine — the feature importances represent the sum total reduction in gini impurity for nodes split on a feature. I never find the absolute values very helpful, but instead normalize the numbers and look at them on a relative basis. For example, below are the 10 most important features from the optimized GBM model.

Most important features from optimized gradient boosting machine.

Here we can see education and ages of family members making up the bulk of the most important features. Looking further into the importances, we also see the size of the family. This echoes findings by poverty researchers: family size is correlated to more extreme poverty, and education level is *inversely *correlated with poverty. In both cases, we don’t necessarily know which causes which, but we can use this information to highlight which factors should be further studied. Hopefully, this data can then be used to further reduce poverty (which has been decreasing steadily for the last 25 years).

It’s true: the world is better now than ever and still improving (source).

In addition to potentially helping researchers, we can use the feature importances for further feature engineering by trying to build more features on top of these. An example using the above results would be taking the meaneduc and dividing by the dependency to create a new feature. While this may not be intuitive, it’s hard to tell ahead of time what will work for a model.

An alternative method to using the testing data to examine our model is to split the training data into a smaller training set and a validation set. Because we have the labels for all the training data, we can compare our predictions on the holdout validation data to the true values. For example, using 1000 observations for validation, we get the following confusion matrix:

Confusion matrix on validation data.

The values on the diagonal are those the model *predicted correctly *because the predicted label is the same as the true label. Anything off the diagonal the model predicted incorrectly. We can see that our model is the best at identifying the non-vulnerable households, but is not very good at discerning the other labels.

As one example, our model incorrectly classifies 18 households as non-vulnerable which are in fact in extreme poverty. Predictions like these have real-world consequences because those might be families that as a result of this model, would not receive help. (For more on the consequences of incorrect algorithms, see Weapons of Math Destruction.)

Overall, this mediocre performance — the model accuracy is about 60% which is much better than random guessing but not exceptional — suggests this problem may be difficult. It could be there is not enough information to separate the classes within the available data.

One recommendation for the host organization — the Inter-American Development Bank — is that we need more data to better solve this problem. That could come either in the form of more features — so more questions on the survey — or more observations — a greater number of households surveyed. Either of these would require a significant effort, but the best return to time invested in a data science project is generally by gathering greater quantities of high-quality labeled data.

There are other methods we can use for model understanding, such as Local Interpretable Model-agnostic Explainer (LIME), which uses a simpler linear model to approximate the model around a prediction. We can also look at individual decision trees in a forest which are typically straightforward to parse because they essentially mimic a human decision making process.

Individual Decision Tree in Random Forest.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.

Exploratory Techniques

We’ve already solved the machine learning problem with a standard toolbox, so why go further into exploratory techniques? Well, if you’re like me, then you enjoy learning new things just for the sake of learning. What’s more, the exploratory techniques of today will be the standard tools of tomorrow.

For this project, I decided to try out two new (to me) techniques:

  • Supervised: given the labels for the training data
  • Multi-Class Classification: labels are discrete with more than 2 values

Recursive Feature Elimination

Recursive feature elimination is a method for feature selection that uses a model’s feature importances — a random forest for this application — to select features. The process is a repeated method: at each iteration, the least important features are removed. The optimal number of features to keep is determined by cross validation on the training data.

Recursive feature elimination is simple to use with Scikit-Learn’s RFECV method. This method builds on an estimator (a model) and then is fit like any other Scikit-Learn method. The scorer part is required in order to make a custom scoring metric using the Macro F1 score.

from sklearn.metrics import f1_score, make_scorer
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
# Custom scorer for cross validation
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')
# Create a model for feature selection
estimator = RandomForestClassifier(n_estimators = 100, n_jobs = -1)
# Create the object
selector = RFECV(estimator, step = 1, cv = 3, 
scoring= scorer, n_jobs = -1)
# Fit on training data
selector.fit(train, train_labels)
# Transform data
train_selected = selector.transform(train)
test_selected = selector.transform(test)

While I’ve used feature importances for selection before, I’d never implemented the Recursive Feature Elimination method, and as usual, was pleasantly surprised at how easy this was to do in Python. The RFECV method selected 58 out of around 190 features based on the cross validation scores:

Recursive Feature Elimination Scores.

The selected set of features were then tried out to compare the cross validation performance with the original set of features. (The final results are presented after the next section). Given the ease of using this method, I think it’s a good tool to have in your skill set for modeling. Like any other Scikit-Learn operation, it can fit into a [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html "Pipeline"), allowing you to quickly execute a complete series of preprocessing and modeling operations.

Dimension Reduction for Visualization

There are a number of unsupervised methods in machine learning for dimension reduction. These fall into two general categories:

  • Supervised: given the labels for the training data
  • Multi-Class Classification: labels are discrete with more than 2 values

Typically, PCA (Principal Components Analysis) and ICA (Independent Components Analysis) are used both for visualization and as a preprocessing step for machine learning, while manifold methods like t-SNE (t-Distributed Stochastic Neighbors Embedding) are used only for visualization because they are highly dependent on hyperparameters and do not preserve distances within the data. (In Scikit-Learn, the t-SNE implementation does not have a transform method which means we can’t use it for modeling).

A new entry on the dimension reduction scene is UMAP: Uniform Manifold Approximation and Projection. It aims to map the data to a low-dimensional manifold — so it’s an embedding technique, while simultaneously preserving global structure in the data. Although the math behind it is rigorous, it can be used like an Scikit-Learn method with a [fit](https://github.com/lmcinnes/umap "fit") and [transform](https://github.com/lmcinnes/umap "transform") call.

I wanted to try these methods for both dimension reduction for visualization, and to add the reduced components as *additional features. *While this use case might not be typical, there’s no harm in experimenting! Below shows the code for using UMAP to create embeddings of both the train and testing data.

import umap as UMAP
n_components = 3
# Use default parameters
umap = UMAP(n_components=n_components)
# Fit and transform
train_reduced = umap.fit_transform(train)
test_reduced = umap.transform(test)

The application of the other three methods is exactly the same (except TSNE which cannot be used to transform the testing data). After completing the transformations, we can visualize the reduced training features in 3 dimensions, with the points colored by the value of the target:

Dimension Reduction Visualizations

None of the methods cleanly separates the data based on the label which follows the findings of other data scientists. As we discovered earlier, it may be that this problem is difficult considering the data to which we have access. Although these graphs cannot be used to say whether or not we can solve a problem, if there is a clean separation, then it indicates that there is *something *in the data that would allow a model to easily discern each class.

As a final step, we can add the reduced features to the set of features after applying feature selection to see if they are useful for modeling. (Usually dimension reduction is applied and then the model is trained on just the reduced dimensions). The performance of every single model is shown below:

FInal model comparison results.

The model using the dimension reduction features has the suffix DR while the number of folds following the GBM refers to the number of cross validation folds. Overall, we can see that the selected set of features (SEL) does slightly better, and adding in the dimension reduction features hurts the model performance! It’s difficult to conclude too much from these results given the large standard deviations, but we *can say *that the Gradient Boosting Machine significantly outperforms all other models and the feature selection process improves the cross validation performance.

The experimental part of this notebook was probably the most enjoyable for me. It’s not only important to always be learning to stay ahead in the data science field, but it’s also enjoyable for the sake of learning something new.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Next Steps

Despite this exhaustive coverage of machine learning tools, we have not yet reached the end of methods to apply to this problem!

Some additional steps we could take are:

  1. Understand the problem and data descriptions
  2. Data cleaning / exploratory data analysis
  3. Feature engineering / feature selection
  4. Model comparison
  5. Model optimization
  6. Interpretation of results

The great part about a Kaggle competition is you can read about many of these cutting-edge techniques in other data scientists’ notebooks. Moreover, these contests give us realistic datasets in a non-mission-critical setting, which is a perfect environment for experimentation.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
As one example of the ability of competitions to better machine learning methods, the ImageNet Large Scale Visual Recognition Challenge led to significant improvements in convolutional neural networks.

Imagenet Competitions have led to state-of-the-art convolutional neural networks.


Data science and machine learning are not incomprehensible methods: instead, they are sequences of straightforward steps that combine into a powerful solution. By walking through a problem one step at a time, we can learn how to build the entire framework. How we use this framework is ultimately up to us. We don’t have to dedicate our lives to helping others, but it is rewarding to take on a challenge with a deeper meaning.

In this article, we saw how we could apply a complete machine learning solution to a data science for good problem, building a machine learning model to predict poverty levels in Costa Rica.

Our approach followed a sequence of processes (1–4 were in part one):

  1. Understand the problem and data descriptions
  2. Data cleaning / exploratory data analysis
  3. Feature engineering / feature selection
  4. Model comparison
  5. Model optimization
  6. Interpretation of results

Finally, if after all that you still haven’t got your fill of data science, you can move on to exploratory techniques and learn something new!

As with any process, you’ll only improve as you practice. Competitions are valuable for the opportunities they provide us to employ and develop skills. Moveover, they encourage discussion, innovation, and collaboration, leading both to more capable individual data scientists and a better community. Through this data science project, we not only improve our skills, but also make an effort to improve outcomes for our fellow humans.