The problem of imbalanced class distribution is prevalent in the field of data science and ML engineers come across it frequently. I am a chatbot developer at IMImoble Pvt Ltd and faced this scenario recently while training intent classification module. Any live business chatbot accessible to real-world users is bound to attract a significant number of out-of-scope queries along with messages pertaining to the task it is designed to perform. Even among the relevant task-oriented messages, imbalances are to be expected as all topics covered by the bot can’t be equally popular. For example, in a banking use case, balance inquiries will outnumber home loan applications.

Bot building is not similar to traditional application development. While the latter is relatively stable and is updated less often, the former needs frequent updates to improve user experience and intelligence of the bot. The imbalanced dataset is the problem where data belonging to one class is significantly higher or lower than that belonging to other classes. Most ML/DL classification algorithms aren’t equipped to handle imbalanced classes and tend to get biased towards majority classes.

Why accuracy is a sham in the case of an imbalanced dataset

Aiming only for high accuracy for the imbalanced dataset can be counter-productive because standard classifier algorithms like Decision Trees and Logistic Regression do not have the ability to handle imbalanced classes incorporated into them. This leads to a heavy bias towards larger classes and classes with fewer data points are treated as noise and are often ignored. The result is a higher misclassification rate for minority classes compared to the majority classes. Therefore, the accuracy metric is not as relevant when evaluating the performance of a model trained on imbalanced data.

Consider the following case: you have two classes — A and B. Class A is 95% of your dataset and class B is the other 5%. You can reach an accuracy of 95% by simply predicting class A every time, but this provides a useless classifier for your intended use case. Instead, a properly calibrated method may achieve a lower accuracy but would have a substantially higher true positive rate (or recall), which is really the metric you should have been optimizing for.

This article explains several methods to handle imbalanced dataset but most of them don’t work well for text data. In this article, I am sharing all the tricks and techniques I have used to balance my dataset along with the code which boosted f1-score by 30%.

Strategies for handling Imbalanced Datasets:

Can you gather more data?

You might think that this is not the solution you’re looking for but gathering more meaningful and diverse data is always better than sampling original data or generating artificial data from existing data points.

#smote #nlp #imbalanced-data #machine-learning #data analysis

Why accuracy is a sham in the case of an imbalanced dataset

Strategies for handling Imbalanced Datasets:

towardsdatascience.com

How I handled imbalanced text data