Pre-Processing and Model Training using Python

Pre-processing and model training go hand in hand in machine learning to mean that one cannot do without the other. The thing is, we humans interact with data we can understand, that is data written in natural language, and we expect our machine learning models to take in the same data and give us some insights. Well, machines only understand binary language (0s and 1s) and there must be a way for the machines to understand this same data. That’s where pre-processing comes in. Pre-processing is basically transforming the data in natural language to a form the machine can understand. This process is also referred to as encoding.

So how do we do pre-processing and model training to get insights from data? There are multiple ways to pre-process data and train machine learning models, for sure I don’t know all of them. In this post, we’ll look at some of the pre-processing methods and explore three machine learning algorithms we can use to train a model. Therefore, we’ll look into:

Pre-Processing

Data Cleaning
Categorical features encoding using OneHotEncoder
Numerical features scaling using StandardScaler
Dimensionality reduction using PCA, T-SNE and Autoencoders
Balancing classes by oversampling
Feature Extraction

Model Training

Logistic Regression
Random Forests
Decision Trees

For this post, we will be using bank campaign data which can be found here. In addition to the data, the source gives a full description of the features in the data. In brief, our data consist of 20 features describing a customer, 10 categorical features and 10 numerical features, and the target variable. Our target variable is a representation of whether a customer subscribed to a term deposit or not. The goal of this project would be to predict on which future customers will subscribe to the term deposits. For a full Exploratory Data Analysis of our dataset you can check my Tableau Dashboard or go through the same on this GitHub repository. This post will only focus on pre-processing and model training.

Pre-Processing

Data Cleaning

Data cleaning involves checking for missing values in a dataset and dropping the null rows or imputing the missing values depending on how many they are and their significance in our data. Data cleaning also involves looking for duplicates in our data and dropping them as they may significantly affect the effectiveness of the model. Data cleaning also involves checking for outliers in our data and replacing the data with median or mean based on how frequent they occur. This are just some of the few ways of data cleaning that we will explore in this post.

Missing Values

We can check for missing values in our data by calling dataset.info() and inspecting all the features in our dataset or simply running dataset.isnull().values.any() which will return True if there’s any null values in our dataset or False if there is none. We can then decide to drop values which contain unique data, or impute categorical features with mode data and numerical features with mean data. Our bank dataset has no missing values and we therefore, proceed to the next step.

#machine-learning #pre-processing #model-training #data-pre-processing

Pre-Processing

Data Cleaning

medium.com

Pre-Processing and Model Training using Python