Deep Learning with Class Imbalanced Data

Class Imbalance:

In machine learning sometimes we are dealt with a very good hand like MNIST fashion data or CIFAR-10 data where the examples of each class in the data-set are well balanced. What happens if in a classification problem the distribution of examples across the known classes are biased or skewed ? Such problems with severe to slight bias in the data-set are common and today we will discuss an approach to handle such class imbalanced data. Let’s consider an extreme case of imbalanced data-set of mails and we build a classifier to detect spam mails. Since spam mails are relatively rarer, let’s consider 5% of all mails are spams. If we just a write a simple one line code as —

def detectspam(mail-data):

 return ‘not spam’ 

This will give us right answer 95% of time and even though this is an extreme hyperbole but you get the problem. Most importantly, training any model with this data will lead to high confidence prediction of the general mails and due to extreme low number of spam mails in the training data, the model will likely not learn to predict the spam mails correctly. This is why precision, recall, F1 score, ROC/AUC curves are the important metrics that truly tell us the story. As you have already guessed one way to reduce this issue is to do sampling to balance the data-set so that classes are balanced. There are several other ways to address class imbalance problem in machine learning and an excellent comprehensive review has been put together by

Jason Brownlee

, check ithere.

#focal-loss #deep-learning #tensorflow #classification #machine-learning

A Loss Function Suitable for Class Imbalanced Data: “Focal Loss”
1.55 GEEK