Classification Framework for Imbalanced Data.

Classification Framework for Imbalanced Data.

Understanding and utilizing imbalanced data.This blog covers the steps involved in tackling a classification problem in imbalanced dataset. The Github repository containing all the code is available here.

Introduction

Classification is a type of supervised learning in Machine Learning that deals with categorizing data into classes. Supervised learning implies models that take input with their matched output to train a model which can later make useful predictions on a new set of data with no output. Some examples of classification problems include: _spam detection in mails, subscription analysis, hand written digit recognition, survival predictions _et.c. They all involve the use of classifiers that utilize the train data to understand how the input variables relate to the output (target) variable.

Image for post

Photo by Mitchell Schwartz on Unsplash

*Class imbalance *refers to a problem in classification where the distribution of the classes are skewed. This can range from a slight to an extreme imbalance.

This is a problem because most classification algorithms will have a low prediction accuracy towards the minority class because they run with the assumption that there’s balance between the classes.

An example of class imbalance is in credit card fraud detection. The classes, in this case, are fraud and not fraud. Most of the transactions are not fraud, thus the fraud classes are the minority class. Having a low accuracy in the minority class predictions is problematic because it is the most important class.

This blog covers the steps involved in tackling a classification problem in imbalanced dataset. The Github repository containing all the code is available here.

Dataset

Data used is from the UCI Machine Learning Repository. The data is related with marketing campaigns of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

An effective model can help increase campaign efficiency as more efforts can be directed towards customers with high subscription chances.

Sample of the data:

Image for post

Visualizing the target variable(y) in order to observe the class imbalance:

Image for post

The circles’ sizes represents the value counts of each class. Clearly there’s extreme class imbalance. This will be taken care of in the preprocessing section.

Preprocessing

Preprocessing involves the following steps:

Imputing null values

Missing values need to be handled because they can lead to wrong predictions and can also cause a high bias for any given model being used.

The categorical features will be imputed with the column mode, the discrete numerical features with the column median and the continuous numerical features with the column mean.

Treating outliers

Outliers are a problem for many machine learning algorithms because they can cause missing of important insights or contort real results which eventually results to less accurate models.

The outliers will be clipped with the 10th and 90th percentiles.

Feature Generation

Generating new features from already existing features adds new information to be accessible during the model training and therefore increasing model accuracy.

Scaling numerical variables

Standardizing numerical features with a StandardScaler to remove the differences brought about by different units of measurement.

class-imbalance classification modeling data-preprocessing resampling data analytic

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Big Data Analytics: Unrefined Data to Smarter Business Insights - TopDevelopers.co

For Big Data Analytics, the challenges faced by businesses are unique and so will be the solution required to help access the full potential of Big Data.

IOT Data Analytics: Data Model

IOT Data Analytics: Data Model. Best Practices of DW Modelling applied on IoT data for most flexible and efficient analytics

Data Analytics For Beginners

🔥Intellipaat Data Analytics training course: https://intellipaat.com/data-analytics-master-training-course/ In this data analytics for beginners video you wi...

How to Define Data Analytics Capabilities | Hacker Noon

Disclaimer: Many points made in this post have been derived from discussions with various parties, but do not represent any individuals or organisations.

Data Cleaning and Preprocessing — Modelling Subscription for Bank Deposits

The exploration of data has always fascinated me. The kind of insights and information that can be hidden in raw data is invigorating to discover and communicate.