Predicting Ebola outbreaks in Sierra Leone


This article was written by Nicolas Diaz_, MPP ’20 at Harvard Kennedy School, based on a project with _Tze Ni YeohAbdulla Saif_, _Lucas Kitzmüller_ and _Victor Sheng.

A GitHub with the Jupyter notebooks from this project is available here.


Supervised learning is one of the most widely used forms of machine learning in the world. This article guides you through some of the most basic steps needed to build a model: importing your data, looking at it, putting it in a consistent format, using a sample from the data to train and test an algorithm and optimizing its parameters.

The code is displayed in Python, a versatile programming language.

The challenge that we have chosen as an example is to predict the spread of Ebola by region during the 2014–2016 outbreak in Sierra Leone.

Problem Motivation and Goals

The Western African Ebola virus epidemic (2014–2016) caused 11,325 deaths and major socioeconomic disruption, with a majority of fatalities taking place in the coastal nation of Sierra Leone. During the outbreak, national authorities had enough resources to isolate and treat all reported cases and stop further transmission of the virus. Unanticipated local variation in the total number of incidences, however, created insufficient response capacity in certain districts (WHO Situation Report, 2014).

There is a need for a complementary tool in informing the allocation of resources across districts and increasing response effectiveness in current and future Ebola emergencies.

Data cleaning

Importing the appropriate libraries

First, we import the libraries that we will use throughout this project. Pandas is the main dataframe manipulation library in Python and Sklearn is an intuitive way to run machine learning models. For Pandas, we will change the defaults with set_option for more comfortable displays of large tables.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000
from sklearn.metrics import mean_squared_error,median_absolute_error,r2_score,mean_absolute_error, accuracy_score, roc_curve, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

Inspecting and understanding the data

Once we have our libraries imported, we load our data with the read_csv() method of Pandas. A number of useful methods allow for a quick inspection of the dataframe: shapehead(), tail() and info.

df = pd.read_csv('sl_ebola.csv')
df.shape

Image for post

df.head(14)

Image for post

df.tail()

Image for post

df.info

Image for post

Looking at the above variables, we are interested in digging deeper into some of the columns. Applying the value_counts() to specific columns will return how many instances of each class are there.

print(df.Country.value_counts(), '\n')
print(df.District.value_counts(), '\n')
print(df['Ebola measure'].value_counts(), '\n')

#ebola #health #data analysis

 Predicting Ebola outbreaks in Sierra Leone
1.45 GEEK