You wrote all your queries, gathered all the data and are all fired up to implement the latest machine learning algorithm you read about on Medium. Wait! You soon realize you need to deal with missing data, imputation, categorical data, standardization, etc.

Instead of “manually” pre-processing data you can start writing functions and data pipelines that you can apply to any data set. Luckily for us, python’s Scikit-Learn library has several classes that will make all of this a piece of cake!

In this article you will learn how to :

  • Reproduce transformations easily on any dataset.
  • Easily track all transformations you apply to your dataset.
  • Start building your library of transformations you can use later on different projects.

tl:dr: Let’s build a pipeline where we can impute, transform, scale, and encode like this:

from sklearn.compose import ColumnTransformer

data_pipeline = ColumnTransformer([
    ('numerical', num_pipeline, num_vars),
    ('categorical', OneHotEncoder(), cat_vars),

])
airbnb_processed = data_pipeline.fit_transform(airbnb_data)

Without knowing much you can infer that different transformations are applied to numerical variables and categorical variables. Let’s go into the proverbial weeds and see how we end up with this pipeline.

Data

For all of these examples, I will be using the airbnb NYC listings dataset from insideairbnb.com . This is a real dataset containing information scraped from airbnb and has all the information related to a listing on the site.

Let us imagine we want to predict the price of a listing given some variables like the property type and neighborhood.

raw_data = pd.read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2020-07-07/data/listings.csv.gz',
                        compression='gzip')

Let’s start by getting our categorical and numerical variables that we want to work with. We will keep it simple by removing data with missing values in our categorical variables of interest and with no reviews.

Imputation

Is a dataset even real if it isn’t missing data? The reality is that we have to deal with missing data all the time. You will have to decide how to deal with missing data for your specific use

  • You can dropna() rows with missing data. Might drop too much data.
  • Drop the variable that has missing data. What if you really want that variable?
  • Replace NAs with zero, the mean, median, or some other calculation.

Scikit-Learn provides us with a nice simple class to deal with missing values.

Let us impute numerical variables such as price or security deposit with the median. For simplicity, we do this for all numerical variables.

#data #data-preprocessing #data-science #python

Pre-Process Data Like a Pro: Intro to Scikit-Learn Pipelines
1.55 GEEK