# Top 3 Machine Learning Algorithms You Need to Know

Top 3 Machine Learning Algorithms You Need to Know: Decision trees vs. clustering algorithms vs. linear regression - Which machine learning algorithms should you use, why, and when?

Top 3 Machine Learning Algorithms You Need to Know: Decision trees vs. clustering algorithms vs. linear regression - Which machine learning algorithms should you use, why, and when?

Imagine you have some data-related problem that you want to solve. You have heard of all the amazing things that machine learning algorithms can achieve and want to try it for yourself — but you have no prior experience or knowledge in this area. You start googling some terms like “machine learning models” and “machine learning methodologies,” but after some time, you find yourself ready to give up, completely lost somewhere between the different algorithms.

Hold on, you brave guy!

Lucky for you, I’m going to walk you through three major categories of machine learning algorithms that will allow you to solve confidently a large range of data science problems.

In the following post, we are going to talk about decision trees, clustering algorithms, and regression, point out their differences, and figure out how to choose the most suitable model for your case.

Supervised vs. Unsupervised Learning

The cornerstone of your understanding is the categorization of the problems in two major classes, supervised and unsupervised learning, as any machine learning problem can be assigned to one of them.

In the case of supervised learning, we have a dataset that will be given to some algorithm as input. We already know what the format of the correct output should look like (assuming that there is some relationship between the input and the output).

As we will see later on, regression and classification problems belong to this category.

On the other hand, unsupervised learning is applied in cases where we have no clue what the output should look like. In fact, we derive structures from data where the effect of the input variables is not known. Clustering problems are the major representatives of this class.

To make the above categorization more clear I’ll give you some real-world problems and try to classify them accordingly.

### Example 1

Imagine you are running a real estate office. Given a new house’s characteristics, you want to predict at what price it is going to be sold based on previous sales of other houses that you have recorded. Your input dataset consists of the characteristics of multiple houses, like the number of bathrooms and lot size, while the variable you want to predict, often called target variable, is the price. Since you know the prices at which the houses of the dataset were sold, this is a supervised learning problem — and more specifically, a regression one.

### Example 2

Consider the results of a medical experiment that aims to predict whether someone is going to develop myopia based on some physical measurements and heredity. In this case, the input dataset consists of the person’s medical characteristics and the target variable is binary: 1 for those who are likely to develop myopia and 0 for those who aren’t. Since you know in advance the value of the target variable of the experiment’s participants (i.e. you know if they were myopic), this is again a supervised learning problem — and more specifically, a classification one.

### Example 3

Suppose you have a company with many customers. Based on their recent interactions with your company, the products they’ve bought lately, and their demographics, you want to form groups of similar customers in order to handle them differently — i.e. offer exclusive discount coupons to some them. In this case, you will use the characteristics mentioned above as input into some algorithm and the algorithm will decide the number or kind of groups that should be formed. That’s a clear example of unsupervised learning problem since we do not have any a priori clue about how the output should look like.

That being said, it’s time to fulfill my promise and move on to some more specific algorithms…

Regression

To begin with, regression is not a single supervised technique but instead a whole category to which many techniques belong.

The main idea in regression is that given some input variables, we want to predict the value of the target. In the case of regression, the target variable is continuous — meaning that it can take any value within a specified range. Input variables, on the other hand, can be either discrete or continuous.

Among regression techniques, the most popular are linear and logistic. Let’s have a closer look at them.

### Linear Regression

Inlinear regression, we are trying to establish a relationship between the input variables and the target, which is expressed by a straight line, usually called regression line.

For example, assuming that we have two input variables X1 and X2 and one target variable Y, this relationship can be mathematically expressed like:

Y = a * X1 + b*X2 +c

Given the values of X1 and X2, we aim to tune a, b, and c so that Y will be as close to the actual value as possible.

So, example time!

Suppose we have the famous Iris dataset, which contains measurements regarding the size of the sepal and the petal of iris flowers belonging to different types, i.e. Setosa, Versicolor, and Virginica.

Using R software, we are going to implement a linear regression to predict the length of the sepal given the petal’s width and length.

Mathematically, we are going the values a and b in the following relationship:

SepalLength = a * PetalWidth + b PetalLength +c*

The corresponding code is:

``````# Load required packages
library(ggplot2)
data(iris)
# Have a look at the first 10 observations of the dataset
# Fit the regression line
fitted_model <- lm(Sepal.Length ~ Petal.Width + Petal.Length, data = iris)
# Get details about the parameters of the selected model
summary(fitted_model)
# Plot the data points along with the regression line
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
geom_point(alpha = 6/10) +
stat_smooth(method = "lm", fill="blue", colour="grey50", size=0.5, alpha = 0.1)

``````

The results of the linear regression are shown in the following graph, where the black points represent the initial data points while the blue line the fitted regression line, which is plotted for the estimated values a = -0.31955, b = 0.54178, and c = 4.19058 that result in the best possible predictions for the target variable, i.e. the sepal length.

From now on, we will be able to make predictions regarding sepal length for new points just by applying the values of petal length and petal width into the defined linear relationship.

### Logistic Regression

The main idea here is exactly the same as it was with linear regression. The difference is that the regression line is not straight anymore.

Instead, the mathematical relationship we are trying to establish is of the following form:

Y=g(a*X1+bX2)*

…where g() is the logistic function.

Due to the properties of the logistic function, Y is continuous, ranging in [0,1] and can be interpreted as the probability of an event to happen.

I know you love examples so I’ll show you one more!

This time, we will experiment with the mtcars dataset, which comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles manufactured in 1973-1974.

Using R, we are going to predict the probability of an automobile of having automatic transmission (am = 0) or manual (am = 1) based on the measurements of V/S and Miles/(US) gallon.

*am = g(a * mpg + b vs +c)**:

``````# Load required packages
library(ggplot2)
data(mtcars)
# Keep a subset of the data features that includes on the measurement we are interested in
cars <- subset(mtcars, select=c(mpg, am, vs))
# Fit the logistic regression line
fitted_model <- glm(am ~ mpg+vs, data=cars, family=binomial(link="logit"))
# Plot the results
ggplot(cars, aes(x=mpg, y=vs, colour = am)) + geom_point(alpha = 6/10) +
stat_smooth(method="glm",fill="blue", colour="grey50", size=0.5, alpha = 0.1, method.args=list(family="binomial"))

``````

The results are shown in the following graph, where the black points represent the initial points of the dataset and the blue line the fitted logistic regression line for estimated a = 0.5359, b = -2.7957, and c = - 9.9183.

We can observe that, as mentioned before, the logistic regression output values will lie only in the range [0,1] due to the form of the regression line.

For any new automobile with given measurements for V/S and Miles/(US) gallon, we can now predict the probability that this car will have automatic transmission. Isn’t that utterly awesome?

Decision Trees

Decision trees are the second type of machine learning algorithms that we are going to investigate. They are divided in regression and classification trees, and thus can be used in supervised learning problems.

Admittedly, decision trees are one of the most intuitive algorithms, as they mimic the way people decide in many cases. What they basically do is draw a “map” of all possible paths along with the corresponding result in each case.

A graphical representation will help understand better what exactly we are talking about.

Based on a tree like this, the algorithm can decide which path to follow at each step depending on the value of the corresponding criterion. The way that the algorithm chooses the splitting criteria along with the respective thresholds at each level depends on how informative the candidate variable is for the target variable and which setup minimizes the error of the prediction produced.

Yet another example is here!

This time, the dataset under examination is readingSkills. It includes information about students who took a certain exam along with their score.

We are going to classify students into native speakers (`nativeSpeaker` = 1) or foreigners (`nativeSpeaker`= 0) based on multiple measures, including their score in the test, their shoe size, and their age.

For this implementation in R, we first need to install the `party` package.

``````# Include required packages
library(party)
library(partykit)
# Have a look at the first ten observations of the dataset
# Grow the decision tree
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
# Plot the results
plot(as.simpleparty(output.tree))

``````

We can see that the first splitting criterion used was the score, due to its high importance in predicting the target variable, whereas shoe size wasn’t taken into consideration at all as it didn’t provide any helpful information regarding the language.

Now, if we have a new student and know their age and score on the test, we can predict whether they are a native speaker!

Clustering Algorithms

Until now, we have talked only a little bit about supervised learning problems. Now, we are moving on to the study of clustering algorithms, a subset of unsupervised learning methods.

So, just a small revision…

With clustering, if we have some initial data at our disposal, we want to form groups so that the data points belonging to some group are similar and are different from data points of the other groups.

The algorithm we are going to study is called “k-means” where k represents the number of produced clusters and is one of the most popular clustering methods.

Remember the Iris dataset we used in the example before? We are going to use it again.

For the purpose of our study, we have plotted all the data points from the dataset using their petal measures, as shown below:

Based solely on petal measures, we are going to cluster the data points into three groups using 3-means clustering.

So how does the 3-means, or more generally, the k-means algorithm, work? The whole process can be summarized in just a few simple steps:

1. Initialization step: For k=3 clusters, the algorithm randomly selects three points as centroids for each cluster.
2. Cluster assignment step: The algorithm goes through the rest of the data points and assigns each one of them to the closest cluster.
3. Centroid move step: After cluster assignment, the centroid of each cluster is moved to the average of all points belonging to the cluster.

Steps 2 and 3 are repeated multiple times until there is no change to be made regarding cluster assignments. The implementation of the k-means algorithm in R is easy and can be done with the following code:

``````# Load required packages
library(ggplot2)
library(datasets)
data(iris)
# Set seed to make results reproducible
set.seed(20)
# Implement k-means with 3 clusters
iris_cl <- kmeans(iris[, 3:4], 3, nstart = 20)
iris_cl\$cluster <- as.factor(iris_cl\$cluster)
# Plot points colored by predicted cluster
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris_cl\$cluster)) + geom_point()

``````

From the result, we can see that the algorithm split the data into three clusters, indicated by three different colors. We can also observe that these clusters were formed based on the size of the petals. More specifically, the red cluster includes flowers with small petals, the green cluster, iris with relatively large petals, while the blue one, iris with medium sized petals.

What is important to note at this point is that in any clustering, the interpretation of the formed groups requires some expert knowledge in the field. In our case, if you are not a botanist, you probably won’t realize that what the k-means managed to do is cluster the iris into their different types, i.e. Setosa, Versicolor, and Virginica without having any knowledge about them!

So if we plot the data again, this time colored by their species, we will see the similarity in the clusters.

Summary

We have come a long way from where we began. We’ve talked about regression (both linear and logistic), decision trees, and finally, k-means clustering. We also set up some simple, yet powerful, implementations of each one of these methods in R.

So, what are the advantages of each algorithm over the others? Which one should you choose for a real-life problem?

Firstly, the presented methods are not some toy algorithms — they are widely used in production systems around the world, and so depending on the task, they can be pretty powerful.

Secondly, in order to answer the above question, you have to be clear what exactly you mean by advantages, as the relative benefits of each method can be expressed in many contexts, i.e. interpretability, robustness, computational time, etc.

Assuming that for the time being, we care only about the method’s appropriateness and predictive performance, a short summary of each method’s pros and cons follows:

Now, we are finally ready to confidently implement this knowledge to some real-world problems!

## Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

• Introduction To Data Science And Machine Learning
• Why Use Python For Data Science And Machine Learning?
• Python Libraries for Data Science And Machine Learning
• Python libraries for Statistics
• Python libraries for Visualization
• Python libraries for Machine Learning
• Python libraries for Deep Learning
• Python libraries for Natural Language Processing

If you liked this post, share it with all of your programming buddies!

## Python Programming for Data Science and Machine Learning

This article provides an overview of Python and its application to Data Science and Machine Learning and why it is important.

Python is a general-purpose, high-level, object-oriented, and easy to learn programming language. It was created by Guido van Rossum who is known as the godfather of Python.

Python is a popular programming language because of its simplicity, ease of use, open source licensing, and accessibility — the foundation of its renowned community, which provides great support and help in creating tons of packages, tutorials, and sample programs.

Python can be used to develop a wide variety of applications — ranging from Web, Desktop GUI based programs/applications to science and mathematics programs, and Machine learning and other big data computing systems.

Let’s explore the use of Python in Machine Learning, Data Science, and Data Engineering.

Machine Learning

Machine learning is a relatively new and evolving system development paradigm that has quickly become a mandatory requirement for companies and programmers to understand and use. See our previous article on Machine Learning for the background. Due to the complex, scientific computing nature of machine learning applications, Python is considered the most suitable programming language. This is because of its extensive and mature collection of mathematics and statistics libraries, extensibility, ease of use and wide adoption within the scientific community. As a result, Python has become the recommended programming language for machine learning systems development.

Data Science

Data science combines cutting edge computer and storage technologies with data representation and transformation algorithms and scientific methodology to develop solutions for a variety of complex data analysis problems encompassing raw and structured data in any format. A Data Scientist possesses knowledge of solutions to various classes of data-oriented problems and expertise in applying the necessary algorithms, statistics, and mathematic models, to create the required solutions. Python is recognized among the most effective and popular tools for solving data science related problems.

Data Engineering

Data Engineers build the foundations for Data Science and Machine Learning systems and solutions. Data Engineers are technology experts who start with the requirements identified by the data scientist. These requirements drive the development of data platforms that leverage complex data extraction, loading, and transformation to deliver structured datasets that allow the Data Scientist to focus on solving the business problem. Again, Python is an essential tool in the Data Engineer’s toolbox — one that is used every day to architect and operate the big data infrastructure that is leveraged by the data scientist.

Use Cases for Python, Data Science, and Machine Learning

Here are some example Data Science and Machine Learning applications that leverage Python.

• Netflix uses data science to understand user viewing pattern and behavioral drivers. This, in turn, helps Netflix to understand user likes/dislikes and predict and suggest relevant items to view.
• Amazon, Walmart, and Target are heavily using data science, data mining and machine learning to understand users preference and shopping behavior. This assists in both predicting demands to drive inventory management and to suggest relevant products to online users or via email marketing.
• Spotify uses data science and machine learning to make music recommendations to its users.
• Spam programs are making use of data science and machine learning algorithm(s) to detect and prevent spam emails.

This article provided an overview of Python and its application to Data Science and Machine Learning and why it is important.

============================================

## A “Data Science for Good“ Machine Learning Project Walk-Through in Python

In this article and the sequel, we’ll walk through a complete machine learning project on a “Data Science for Good” problem: predicting household poverty in Costa Rica. Not only do we get to improve our data science skills in the most effective manner — through practice on real-world data — but we also get the reward of working on a problem with social benefits.

In this article and the sequel, we’ll walk through a complete machine learning project on a “Data Science for Good” problem: predicting household poverty in Costa Rica. Not only do we get to improve our data science skills in the most effective manner — through practice on real-world data — but we also get the reward of working on a problem with social benefits.

A “Data Science for Good“ Machine Learning Project Walk-Through in Python: Part One: Solving a complete machine learning problem for societal benefit

Data science is an immensely powerful tool in our data-driven world. Call me idealistic, but I believe this tool should be used for more than getting people to click on ads or spend more time consumed by social media.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

### Problem and Approach

The Costa Rican Household Poverty Level Prediction challenge is a data science for good machine learning competition currently running on Kaggle. The objective is to use individual and household socio-economic indicators to predict poverty on a household basis. IDB, the Inter-American Development Bank, developed the problem and provided the data with the goal of improving upon traditional methods for identifying families at need of aid.

The Costa Rican Poverty Prediction contest is currently running on Kaggle.

The poverty labels fall into four levels making this a supervised multi-class classification problem:

• Supervised: given the labels for the training data
• Multi-Class Classification: labels are discrete with more than 2 values

The general approach to a machine learning problem is:

1. Understand the problem and data descriptions
2. Data cleaning / exploratory data analysis
3. Feature engineering / feature selection
4. Model comparison
5. Model optimization
6. Interpretation of results

While these steps may seem to present a rigid structure, the machine learning process is non-linear, with parts repeated multiple times as we get more familiar with the data and see what works. It’s nice to have an outline to provide a general guide, but we’ll often return to earlier parts of the process if things aren’t working out or as we learn more about the problem.

We’ll go through the first four steps at a high-level in this article, taking a look at some examples, with the full details available in the notebooks. This problem is a great one to tackle both for beginners — because the dataset is manageable in size — and for those who already have a firm footing because Kaggle offers an ideal environment for experimenting with new techniques.

### Understanding the Problem and Data

In an ideal situation, we’d all be experts in the problem subject with years of experience to inform our machine learning. In reality, we often work with data from a new field and have to rapidly acquire knowledge both of what the data represents and how it was collected.

Fortunately, on Kaggle, we can use the work shared by other data scientists to get up to speed relatively quickly. Moreover, Kaggle provides a discussion platform where you can ask questions of the competition organizers. While not exactly the same as interacting with customers at a real job, this gives us an opportunity to figure out what the data fields represent and any considerations we should keep in mind as we get into the problem.

Some good questions to ask at this point are:

• Supervised: given the labels for the training data
• Multi-Class Classification: labels are discrete with more than 2 values

For example, after engaging in discussions with the organizers, the community found out the text string “yes” actually maps to the value 1.0 and that the maximum value in one of the columns should be 5 which can be used to correct outliers. We would have been hard-pressed to find out this information without someone who knows the data collection process!

Part of data understanding also means digging into the data definitions. The most effective way is literally to go through the columns one at a time, reading the description and making sure you know what the data represents. I find this a little dull, so I like to mix this process with data exploration, reading the column description and then exploring the column with stats and figures.

For example, we can read that `meaneduc` is the average amount of education in the family, and then we can plot it distributed by the value of the label to see if it has any noticeable differences between the poverty level .

Average schooling in family by target (poverty level).

This shows that families the least at risk for poverty — non-vulnerable — tend to have higher average education levels than those most at risk. Later in feature engineering, we can use this information by building features from the education since it seems to show a different between the target labels.

There are a total of 143 columns (features), and while for a real application, you want to go through each with an expert, I didn’t exhaustively explore all of these in the notebook. Instead, I read the data definitions and looked at the work of other data scientists to understand most of the columns.

Another point to establish from the problem and data understanding stage is how we want to structure our training data. In this problem, we’re given a single table of data where each row represents an individual and the columns are the features. If we read the problem definition, we are told to make predictions for each household which means that our final training dataframe (and also testing) should have one row for each house. This point informs our entire pipeline, so it’s crucial to grasp at the outset.

A snapshot of the data where each row is one individual.

#### Determine the Metric

Finally, we want to make sure we understanding the labels and the metric for the problem. The label is what we want to predict, and the metric is how we’ll evaluate those predictions. For this problem, the label is an integer, from 1 to 4, representing the poverty level of a household. The metric is the Macro F1 Score, a measure between 0 and 1 with a higher value indicating a better model.** **The F1 score is a common metric for binary classification tasks and “Macro” is one of the averaging options for multi-class problems.

Once you know the metric, figure out how to calculate it with whatever tool you are using. For Scikit-Learn and the Macro F1 score, the code is:

``````from sklearn.metrics import f1_score
# Code to compute metric on predictions
score = f1_score(y_true, y_prediction, average = 'macro')

``````

Knowing the metric allows us to assess our predictions in cross validation and using a hold-out testing set, so we know what effect, if any, our choices have on performance. For this competition, we are given the metric to use, but in a real-world situation, we’d have to choose an appropriate measure ourselves.

### Data Exploration and Data Cleaning

Data exploration, also called Exploratory Data Analysis (EDA), is an open-ended process where we figure out what our data can tell us. We start broad and gradually hone in our analysis as we discover interesting trends / patterns that can be used for feature engineering or find anomalies. Data cleaning goes hand in hand with exploration because we need to address missing values or anomalies as we find them before we can do modeling.

For an easy first step of data exploration, we can visualize the distribution of the labels for the training data (we are not given the testing labels).

Distribution of training labels.

Right away this tells us we have an imbalanced classification problem, which can make it difficult for machine learning models to learn the underrepresented classes. Many algorithms have ways to try and deal with this, such as setting `class_weight = "balanced"` in the Scikit-Learn random forest classifier although they don’t work perfectly. We also want to make sure to use stratified sampling with cross validation when we have an imbalanced classification problem to get the same balance of labels in each fold.

To get familiar with the data, it’s helpful to go through the different column data types which represent different statistical types of data:

• Supervised: given the labels for the training data
• Multi-Class Classification: labels are discrete with more than 2 values

I’m using *statistical type *to mean what the data represents — for example a Boolean that can only be 1 or 0 — and *data type *to mean the actual way the values are stored in Python such as integers or floats. The statistical type informs how we handle the columns for feature engineering.

(I specified *usually *for each data type / statistical type pairing because you may find that statistical types are saved as the wrong data type.)

If we look at the integer columns for this problem, we can see that most of them represent Booleans because there are only two possible values:

Integer columns in data.

Going through the object columns, we are presented with a puzzle: 2 of the columns are Id variables (stored as strings), but 3 look to be numeric values.

``````# Train is pandas dataframe of training data

``````

Object columns in original data.

This is where our earlier data understanding comes into play. For these three columns, some entries are “yes” and some are “no” while the rest are floats. We did our background research and thus know that a “yes” means 1 and a “no” means 0. Using this information, we can correct the values and then visualize the variable distributions colored by the label.

Distribution of corrected variables by the target label.

This is a great example of data exploration and cleaning going hand in hand. We find something incorrect with the data, fix it, and then explore the data to make sure our correction was appropriate.

#### Missing Values

A critical data cleaning operation for this data is handling missing values. To calculate the total and percent of missing values is simple in Pandas:

``````import pandas as pd
# Number of missing in each column
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'})
# Create a percentage missing
missing['percent'] = missing['total'] / len(data)

``````

Missing values in data.

In some cases there are reasons for missing values: the `v2a1` column represents monthly rent and many of the missing values are because the household owns the home. To figure this out, we can subset the data to houses missing the rent payment and then plot the `tipo_` variables (I’m not sure where these column names come from) which show home ownership.

Home ownership status for those households with no rent payments.

Based on the plot, the solution is to fill in the missing rent payments for households that own their house with 0 and leave the others to be imputed. We also add a boolean column that indicates if the rent payment was missing.

The other missing values in the columns are dealt with the same way: using knowledge from other columns or about the problem to fill in the values, or leaving them to be imputed. Adding a boolean column to indicate missing values can also be useful because sometimes the *information that a value was missing *is important. Another crucial point to note is that for missing values, we often want to think about using information in other columns to fill in missing values such as we did with the rent payment.

Once we’ve handled the missing values, anomalies, and incorrect data types, we can move on to feature engineering. I usually view data exploration as an ongoing process rather than one set chunk. For example, as we get into feature engineering, we might want to explore the new variables we create.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Feature Engineering

If you follow my work, you’ll know I’m convinced automated feature engineering — with domain expertise — will take the place of traditional manual feature engineering. For this problem, I took both approaches, doing mostly manual work in the main notebook, and then writing another notebook with automated feature engineering. Not surprisingly, the automated feature engineering took one tenth the time and achieved better performance! Here I’ll show the manual version, but keep in mind that automated feature engineering (with Featuretools) is a great tool to learn.

In this problem, our primary objective for feature engineering is to aggregate all the individual level data at the household level. That means grouping together the individuals from one house and then calculating statistics such as the maximum age, the average level of education, or the total number of cellphones owned by the family.

Fortunately, once we have separated out the individual data (into the `ind` dataframe), doing these aggregations is literally one line in Pandas (with `idhogar` the household identifier used for grouping):

``````# Aggregate individual data for each household
ind_agg = ind.groupby('idhogar').agg(['min', 'max', 'mean', 'sum'])

``````

After renaming the columns, we have a lot of features that look like:

Features produced by aggregation of individual data.

The benefit of this method is that it quickly creates many features. One of the drawbacks is that many of these features might not be useful or are highly correlated (called collinear) which is why we need to use feature selection.

An alternative method to aggregations is to calculate features one at a time using domain knowledge based on what features might be useful for predicting poverty. For example, in the household data, we create a feature called `warning` which adds up a number of household “warning signs” ( `house` is a dataframe of the household variables):

``````# No toilet, no electricity, no floor, no water service, no ceiling
house['warning'] = 1 * (house['sanitario1'] +
(house['elec'] == 0) +
house['pisonotiene'] +
house['abastaguano'] +
(house['cielorazo'] == 0))

``````

Violinplot of Target by Warning Value.

We can also calculate “per capita” features by dividing one value by another ( `tamviv` is the number of household members):

``````# Per capita features for household data
house['phones-per-capita'] = house['qmobilephone'] / house['tamviv']
house['tablets-per-capita'] = house['v18q1'] / house['tamviv']
house['rooms-per-capita'] = house['rooms'] / house['tamviv']
house['rent-per-capita'] = house['v2a1'] / house['tamviv']

``````

When it comes to manual vs automated feature engineering, I think the optimal answer is a blend of both. As humans, we are limited in the features we build both by creativity — there are only so many features we can think to make — and time — there is only so much time for us to write the code. We can make a few informed features like those above by hand, but where automated feature engineering excels is when doing aggregations that can automatically build on top of other features.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
(Featuretools is the most advanced open-source Python library for automated feature engineering. Here’s an article to get you started in about 10 minutes.)

#### Feature Selection

Once we have exhausted our time or patience making features, we apply feature selection to remove some features, trying to keep only those that are useful for the problem. “Useful” has no set definition, but there are some heuristics (rules of thumb) that we use to select features.

One method is by determining correlations between features. Two variables that are highly correlated with one another are called collinear. These are a problem in machine learning because they slow down training, create less interpretable models, and can decrease model performance by causing overfitting on the training data.

The tricky part about removing correlated features is determining the threshold of correlation for saying that two variables are too correlated. I generally try to stay conservative, using a correlation coefficient in the 0.95 or above range. Once we decide on a threshold, we use the below code to remove one out of every pair of variables with a correlation above this value:

``````import numpy as np
threshold = 0.95
# Create correlation matrix
corr_matrix = data.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > threshold)]
data = data.drop(columns = to_drop)

``````

We are only removing features that are correlated with one another. We want features that are correlated with the target(although a correlation of greater than 0.95 with the label would be too good to be true)!

There are many methods for feature selection (we’ll see another one in the experimental section near the end of the article). These can be univariate — measuring one variable at a time against the target — or multivariate — assessing the effects of multiple features. I also tend to use model-based feature importances for feature selection, such as those from a random forest.

After feature selection, we can do some exploration of our final set of variables, including making a correlation heatmap and a pairsplot.

Correlation heatmap (left) and pairsplot colored by the value of the label (right).

One point we get from the exploration is the relationship between education and poverty: as the education of a household increases (both the average and the maximum), the severity of poverty tends to decreases (1 is most severe):

Max schooling of the house by target value.

On the other hand, as the level of overcrowding — the number of people per room — increases, the severity of the poverty increases:

Household overcrowding by value of the target.

These are two actionable insights from this competition, even before we get to the machine learning: households with greater levels of education tend to have less severe poverty, and households with more people per room tend to have greater levels of poverty. I like to think about the ramifications and larger picture of a data science project in addition to the technical aspects. It can be easy to get overwhelmed with the details and then forget the overall reason you’re working on this problem.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Model Comparison

The following graph is one of my favorite results in machine learning: it displays the performance of machine learning models on many datasets, with the percentages showing how many times a particular method beat any others. (This is from a highly readable paper by Randal Olson.)

Comparison of many algorithms on 165 datasets.

What this shows is that there are some problems where even a simple Logistic Regression will beat a Random Forest or Gradient Boosting Machine. Although the Gradient Tree Boosting model generally works the best, it’s not a given that it will come out on top. Therefore, when we approach a new problem, the best practice is to try out several different algorithms rather than always relying on the same one. I’ve gotten stuck using the same model (random forest) before, but remember that no one model is always the best.

Fortunately, with Scikit-Learn, it’s easy to evaluate many machine learning models using the same syntax. While we won’t do hyperparameter tuning for each one, we can compare the models with the default hyperparameters in order to select the most promising model for optimization.

In the notebook, we try out six models spanning the range of complexity from simple — Gaussian Naive Bayes — to complex — Random Forest and Gradient Boosting Machine. Although Scikit-Learn does have a GBM implementation, it’s fairly slow and a better option is to use one of the dedicated libraries such as XGBoost or LightGBM. For this notebook, I used Light GBM and choose the hyperparameters based on what have worked well in the past.

To compare models, we calculate the cross validation performance on the training data over 5 or 10 folds. We want to use the training data because the testing data is only meant to be used once as an estimate of the performance of our final model on new data. The following plot shows the model comparison. The height of the bar is the average Macro F1 score over the folds recorded by the model and the black bar is the standard deviation:

Model cross validation comparison results.

(To see an explanation of the names, refer to the notebook. RF stands for Random Forest and GBM is Gradient Boosting Machine with SEL representing the feature set after feature selection). While this isn’t entirely a level comparison — I did not use the default hyperparameters for the Gradient Boosting Machine — the general results hold: the GBM is the best model by a large margin. This reflects the findings of most other data scientists.

Notice that we cross-validated the data before and after feature selection to see its effect on performance. Machine learning is still largely an empirical field, and the only way to know if a method is effective is to try it out and then measure performance. It’s important to test out different choices for the steps in the pipeline — such as the correlation threshold for feature selection — to determine if they help. Keep in mind that we also want to avoid placing too much weight on cross-validation results, because even with many folds, we can still be overfitting to the training data. Finally, even though the GBM was best for this dataset, that will not always be the case!

Based on these results, we can choose the gradient boosting machine as our model (remember this is a decision we can go back and revise!). Once we decide on a model, the next step is to get the most out of it, a process known as model hyperparameter optimization.

Recognizing that not everyone has time for a 30-minute article (even on data science) in one sitting, I’ve broken this up into two parts. The second part covers model optimization, interpretation, and an experimental section.

Decision tree visualization from part two.

### Conclusions

By this point, we can see how all the different parts of machine learning come together to form a solution: we first had to understand the problem, then we dug into the data, cleaning it as necessary, then we made features for a machine learning model, and finally we evaluated several different models.

We’ve covered many techniques and have a decent model (although the F1 score is relatively low, it places in the top 50 models submitted to the competition). Nonetheless, we still have a few steps left: through optimization, we can improve our model, and then we have to interpret our results because no analysis is complete until we’ve communicated our work.

A “Data Science for Good” Machine Learning Project Walk-Through in Python: Part Two: Getting the most from our model, figuring out what it all means, and experimenting with new techniques

Machine learning is a powerful framework that from the outside may look complex and intimidating. However, once we break down a problem into its component steps, we see that machine learning is really only a sequence of understandable processes, each one simple by itself.

In the first half of this series, we saw how we could implement a solution to a “data science for good” machine learning problem, leaving off after we had selected the Gradient Boosting Machine as our model of choice.

Model evaluation results from part one.

In this article, we’ll continue with our pipeline for predicting poverty in Costa Rica, performing model optimizing, interpreting the model, and trying out some experimental techniques.

The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills.

### Model Optimization

Model optimization means searching for the model hyperparameters that yield the best performance — measured in cross-validation — for a given dataset. Because the optimal hyperparameters vary depending on the data, we have to optimize — also known as tuning — the model for our data. I like to think of tuning as finding the best settings for a machine learning model.

There are 4 main methods for tuning, ranked from least efficient (manual) to most efficient (automated).

1. Understand the problem and data descriptions
2. Data cleaning / exploratory data analysis
3. Feature engineering / feature selection
4. Model comparison
5. Model optimization
6. Interpretation of results

Naturally, we’ll skip the first three methods and move right to the most efficient: automated hyperparameter tuning. For this implementation, we can use the Hyperopt library, which does optimization using a version of Bayesian Optimization with the Tree Parzen Estimator. You don’t need to understand these terms to use the model, although I did write a conceptual explanation here. (I also wrote an article for using Hyperopt for model tuning here.)

The details are a little protracted (see the notebook), but we need 4 parts for implementing Bayesian Optimization in Hyperopt

1. Understand the problem and data descriptions
2. Data cleaning / exploratory data analysis
3. Feature engineering / feature selection
4. Model comparison
5. Model optimization
6. Interpretation of results

The basic idea of Bayesian Optimization (BO) is that the algorithm reasons from the past results — how well previous hyperparameters have scored — and then chooses the *next *combination of values it thinks will do best. Grid or random search are *uninformed *methods that don’t use past results and the idea is that by reasoning, BO can find better values in fewer search iterations.

See the notebook for the complete implementation, but below are the optimization scores plotted over 100 search iterations.

Model optimization scores versus iteration.

Unlike in random search where the scores are, well random over time, in Bayesian Optimization, the scores tend to improve over time as the algorithm learns a probability model of the best hyperparameters. The idea of Bayesian Optimization is that we can optimize our model (or any function) quicker by focusing the search on promising settings. Once the optimization has finished running, we can use the best hyperparameters to cross validate the model.

Optimizing the model will not always improve our test score because we are optimizing for the *training *data. However, sometimes it can deliver a large benefit compared to the default hyperparameters. In this case, the final cross validation results are shown below in dataframe form:

Cross validation results. Models without 10Fold in name were validated with 5 folds. SEL is selected features.

The optimized model (denoted by OPT and using 10 cross validation folds with the features after selection) places right in the middle of the non-optimized variations of the Gradient Boosting Machine (which used hyperparameters I had found worked well for previous problems.) This indicates we haven’t found the optimal hyperparameters yet, or there could be multiple sets of hyperparameters that performly roughly the same.

We can continue optimization to try and find even better hyperparameters, but usually the return to hyperparameter tuning is much less than the return to feature engineering. At this point we have a relatively high-performing model and we can use this model to make predictions on the test data. Then, since this is a Kaggle competition, we can submit the predictions to the leaderboard. Doing this gets us into the top 50 (at the moment) which is a nice vindication of all our hard work!

At this point, we have implemented a complete solution to this machine learning problem. Our model can make reasonably accurate predictions of poverty in Costa Rican households (the F1 score is relatively low, but this is a difficult problem). Now, we can move on to interpreting our predictions and see if our model can teach us anything about the problem. Even though we have a solution, we don’t want to lose sight of why our solution matters.

The very nature of machine learning competitions can encourage bad practices, such as the mistake of optimizing for the leaderboard score at the cost of all other considerations. Generally this leads to using ever more complex models to eke out a tiny performance gain.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
A simple model that is put in use is better than a complex model which can never be deployed. Moreover, those at the top of the leaderboard are probably overfitting to the testing data and do not have a robust model.
It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Interpret Model Results

In the midst of writing all the machine learning code, it can be easy to lose sight of the important questions: what are we making this model for? What will be the impact of our predictions? Thankfully, our answer this time isn’t “increasing ad revenue” but, instead, effectively predicting which households are most at risk for poverty in Costa Rica so they can receive needed help.

To try and get a sense of our model’s output, we can examine the prediction of poverty levels on a household basis for the test data. For the test data, we don’t know the true answers, but we can compare the relative frequency of each predicted class with that in the training labels. The image below shows the training distribution of poverty on the left, and the predicted distribution for the testing data on the right:

Training label distribution (left) and predicted test distribution (right). Both histograms are normalized.

Intriguingly, even though the label “not vulnerable” is most prevalent in the training data, it is represented less often on a relative basis for the predictions. Our model predicts a higher proportion of the other 3 classes, which means that it thinks there is more severe poverty in the testing data. If we convert these fractions to numbers, we have 3929 households in the “non vulnerable” category and 771 households in the “extreme” category.

Another way to look at the predictions is by the confidence of the model. For each prediction on the test data, we can see not only the label, but also the probability given to it by the model. Let’s take a look at the confidence by the value of the label in a boxplot.

Boxplot of probability assigned to each label on testing data.

These results are fairly intuitive — our model is most confident in the most extreme predictions — and less confident in the moderate ones. Theoretically, there should be more separation between the most extreme labels and the targets in the middle should be more difficult to tease apart.

Another point to draw from this graph is that overall, our model is not very sure of the predictions. A guess with no data would place 0.25 probability on each class, and we can see that even for the least extreme poverty, our model rarely has more than 40% confidence. What this tells us is this is a tough problem — there is not much to separate the classes in the available data.

Ideally, these predictions, or those from the winning model in the competition, will be used to determine which families are most likely to need assistance. However, just the predictions alone do not tell us what may lead to the poverty or how our model “thinks”. While we can’t completely solve this problem yet, we can try to peer into the black box of machine learning.

In a tree-based model — such as the Gradient Boosting Machine — the feature importances represent the sum total reduction in gini impurity for nodes split on a feature. I never find the absolute values very helpful, but instead normalize the numbers and look at them on a relative basis. For example, below are the 10 most important features from the optimized GBM model.

Most important features from optimized gradient boosting machine.

Here we can see education and ages of family members making up the bulk of the most important features. Looking further into the importances, we also see the size of the family. This echoes findings by poverty researchers: family size is correlated to more extreme poverty, and education level is *inversely *correlated with poverty. In both cases, we don’t necessarily know which causes which, but we can use this information to highlight which factors should be further studied. Hopefully, this data can then be used to further reduce poverty (which has been decreasing steadily for the last 25 years).

It’s true: the world is better now than ever and still improving (source).

In addition to potentially helping researchers, we can use the feature importances for further feature engineering by trying to build more features on top of these. An example using the above results would be taking the `meaneduc` and dividing by the `dependency` to create a new feature. While this may not be intuitive, it’s hard to tell ahead of time what will work for a model.

An alternative method to using the testing data to examine our model is to split the training data into a smaller training set and a validation set. Because we have the labels for all the training data, we can compare our predictions on the holdout validation data to the true values. For example, using 1000 observations for validation, we get the following confusion matrix:

Confusion matrix on validation data.

The values on the diagonal are those the model *predicted correctly *because the predicted label is the same as the true label. Anything off the diagonal the model predicted incorrectly. We can see that our model is the best at identifying the non-vulnerable households, but is not very good at discerning the other labels.

As one example, our model incorrectly classifies 18 households as non-vulnerable which are in fact in extreme poverty. Predictions like these have real-world consequences because those might be families that as a result of this model, would not receive help. (For more on the consequences of incorrect algorithms, see Weapons of Math Destruction.)

Overall, this mediocre performance — the model accuracy is about 60% which is much better than random guessing but not exceptional — suggests this problem may be difficult. It could be there is not enough information to separate the classes within the available data.

One recommendation for the host organization — the Inter-American Development Bank — is that we need more data to better solve this problem. That could come either in the form of more features — so more questions on the survey — or more observations — a greater number of households surveyed. Either of these would require a significant effort, but the best return to time invested in a data science project is generally by gathering greater quantities of high-quality labeled data.

There are other methods we can use for model understanding, such as Local Interpretable Model-agnostic Explainer (LIME), which uses a simpler linear model to approximate the model around a prediction. We can also look at individual decision trees in a forest which are typically straightforward to parse because they essentially mimic a human decision making process.

Individual Decision Tree in Random Forest.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.

### Exploratory Techniques

We’ve already solved the machine learning problem with a standard toolbox, so why go further into exploratory techniques? Well, if you’re like me, then you enjoy learning new things just for the sake of learning. What’s more, the exploratory techniques of today will be the standard tools of tomorrow.

For this project, I decided to try out two new (to me) techniques:

• Supervised: given the labels for the training data
• Multi-Class Classification: labels are discrete with more than 2 values

#### Recursive Feature Elimination

Recursive feature elimination is a method for feature selection that uses a model’s feature importances — a random forest for this application — to select features. The process is a repeated method: at each iteration, the least important features are removed. The optimal number of features to keep is determined by cross validation on the training data.

Recursive feature elimination is simple to use with Scikit-Learn’s RFECV method. This method builds on an estimator (a model) and then is `fit` like any other Scikit-Learn method. The `scorer` part is required in order to make a custom scoring metric using the Macro F1 score.

``````from sklearn.metrics import f1_score, make_scorer
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
# Custom scorer for cross validation
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')
# Create a model for feature selection
estimator = RandomForestClassifier(n_estimators = 100, n_jobs = -1)
# Create the object
selector = RFECV(estimator, step = 1, cv = 3,
scoring= scorer, n_jobs = -1)
# Fit on training data
selector.fit(train, train_labels)
# Transform data
train_selected = selector.transform(train)
test_selected = selector.transform(test)

``````

While I’ve used feature importances for selection before, I’d never implemented the Recursive Feature Elimination method, and as usual, was pleasantly surprised at how easy this was to do in Python. The RFECV method selected 58 out of around 190 features based on the cross validation scores:

Recursive Feature Elimination Scores.

The selected set of features were then tried out to compare the cross validation performance with the original set of features. (The final results are presented after the next section). Given the ease of using this method, I think it’s a good tool to have in your skill set for modeling. Like any other Scikit-Learn operation, it can fit into a `[Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html "Pipeline")`, allowing you to quickly execute a complete series of preprocessing and modeling operations.

#### Dimension Reduction for Visualization

There are a number of unsupervised methods in machine learning for dimension reduction. These fall into two general categories:

• Supervised: given the labels for the training data
• Multi-Class Classification: labels are discrete with more than 2 values

Typically, PCA (Principal Components Analysis) and ICA (Independent Components Analysis) are used both for visualization and as a preprocessing step for machine learning, while manifold methods like t-SNE (t-Distributed Stochastic Neighbors Embedding) are used only for visualization because they are highly dependent on hyperparameters and do not preserve distances within the data. (In Scikit-Learn, the t-SNE implementation does not have a `transform` method which means we can’t use it for modeling).

A new entry on the dimension reduction scene is UMAP: Uniform Manifold Approximation and Projection. It aims to map the data to a low-dimensional manifold — so it’s an embedding technique, while simultaneously preserving global structure in the data. Although the math behind it is rigorous, it can be used like an Scikit-Learn method with a `[fit](https://github.com/lmcinnes/umap "fit")` and `[transform](https://github.com/lmcinnes/umap "transform")` call.

I wanted to try these methods for both dimension reduction for visualization, and to add the reduced components as *additional features. *While this use case might not be typical, there’s no harm in experimenting! Below shows the code for using UMAP to create embeddings of both the train and testing data.

``````import umap as UMAP
n_components = 3
# Use default parameters
umap = UMAP(n_components=n_components)
# Fit and transform
train_reduced = umap.fit_transform(train)
test_reduced = umap.transform(test)

``````

The application of the other three methods is exactly the same (except TSNE which cannot be used to `transform` the testing data). After completing the transformations, we can visualize the reduced training features in 3 dimensions, with the points colored by the value of the target:

Dimension Reduction Visualizations

None of the methods cleanly separates the data based on the label which follows the findings of other data scientists. As we discovered earlier, it may be that this problem is difficult considering the data to which we have access. Although these graphs cannot be used to say whether or not we can solve a problem, if there is a clean separation, then it indicates that there is *something *in the data that would allow a model to easily discern each class.

As a final step, we can add the reduced features to the set of features after applying feature selection to see if they are useful for modeling. (Usually dimension reduction is applied and then the model is trained on just the reduced dimensions). The performance of every single model is shown below:

FInal model comparison results.

The model using the dimension reduction features has the suffix DR while the number of folds following the GBM refers to the number of cross validation folds. Overall, we can see that the selected set of features (SEL) does slightly better, and adding in the dimension reduction features hurts the model performance! It’s difficult to conclude too much from these results given the large standard deviations, but we *can say *that the Gradient Boosting Machine significantly outperforms all other models and the feature selection process improves the cross validation performance.

The experimental part of this notebook was probably the most enjoyable for me. It’s not only important to always be learning to stay ahead in the data science field, but it’s also enjoyable for the sake of learning something new.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.### Next Steps

Despite this exhaustive coverage of machine learning tools, we have not yet reached the end of methods to apply to this problem!

Some additional steps we could take are:

1. Understand the problem and data descriptions
2. Data cleaning / exploratory data analysis
3. Feature engineering / feature selection
4. Model comparison
5. Model optimization
6. Interpretation of results

The great part about a Kaggle competition is you can read about many of these cutting-edge techniques in other data scientists’ notebooks. Moreover, these contests give us realistic datasets in a non-mission-critical setting, which is a perfect environment for experimentation.

It turns out the same skills used by companies to maximize ad views can also be used to help relieve human suffering.
As one example of the ability of competitions to better machine learning methods, the ImageNet Large Scale Visual Recognition Challenge led to significant improvements in convolutional neural networks.

Imagenet Competitions have led to state-of-the-art convolutional neural networks.

### Conclusions

Data science and machine learning are not incomprehensible methods: instead, they are sequences of straightforward steps that combine into a powerful solution. By walking through a problem one step at a time, we can learn how to build the entire framework. How we use this framework is ultimately up to us. We don’t have to dedicate our lives to helping others, but it is rewarding to take on a challenge with a deeper meaning.

In this article, we saw how we could apply a complete machine learning solution to a data science for good problem, building a machine learning model to predict poverty levels in Costa Rica.

Our approach followed a sequence of processes (1–4 were in part one):

1. Understand the problem and data descriptions
2. Data cleaning / exploratory data analysis
3. Feature engineering / feature selection
4. Model comparison
5. Model optimization
6. Interpretation of results

Finally, if after all that you still haven’t got your fill of data science, you can move on to exploratory techniques and learn something new!

As with any process, you’ll only improve as you practice. Competitions are valuable for the opportunities they provide us to employ and develop skills. Moveover, they encourage discussion, innovation, and collaboration, leading both to more capable individual data scientists and a better community. Through this data science project, we not only improve our skills, but also make an effort to improve outcomes for our fellow humans.