Churn prediction of bank customers from EDA to model evaluation.

Churn prediction is a common use case in the machine learning domain. If you are not familiar with the term, churn means “leaving the company”. It is very critical for a business to have an idea about why and when customers are likely to churn. Having a robust and accurate churn prediction model helps businesses to take action to prevent customers from leaving the company.

Image for post

Photo by Chris Liverani on Unsplash

In this post, we aim to build a supervised learning algorithm to perform a classification task. The goal is to predict whether a customer will churn (i.e. exited = 1) using the provided features. The dataset is available here on Kaggle.

The first step is to read the dataset into a pandas dataframe.

import pandas as pd
import numpy as np

df_churn = pd.read_csv("/content/Churn_Modelling.csv")
df_churn.shape
(10000, 14)
df_churn.columns
Index(
['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography','Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard','IsActiveMember', 'EstimatedSalary', 'Exited'],       dtype='object')

The dataset contains 10000 customers (i.e. rows) and 14 features about the customers and their products at a bank.


Exploring the Data

There are a few redundant features. “RowNumber” column is just an index. “CustomerId” and “Surname” columns are obviously useless for a machine learning model. The last name or ID of a customer will not tell us anything about customer churn. Thus, we should drop them not to put unnecessary computation burden on the model.

df_churn.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

df_churn.head()

Image for post

Let’s also check if there is any missing value in the dataset.

df_churn.isna().sum()

Image for post

This dataset does not have any missing value which is not typical with real-life datasets. Handling missing values is an important part of the machine learning pipeline. If there are very few missing values compared to the size of the dataset, we may choose to drop rows that have missing values. Otherwise, it is better to replace them with appropriate values. Pandas **fillna **function can be used to handle this task.

Important note: If you choose to impute missing values based on the non-missing values in a column (e.g. fill missing values with the mean value of a column), you should do it after splitting your dataset into train and test subsets. Otherwise, you leak data to the machine learning model from the test set which is supposed to be new, previously unseen data.

We should also make sure the data stored with appropriate data types. For instance, the numerical values should not be stored as “object”. **Dtypes **function returns the data type of each column.

df_churn.dtypes

#python #programming #aritificial-intelligence #machine-learning #data-science

A Practical Machine Learning Guide
1.25 GEEK