Sooner or later in your data science career you will come across a problem where one event, usually the one that you are trying to predict, is less frequent than the other or others.

After all, reality is like that — car crashes or people with diseases are more scarce (thankfully!) than trajectories completed by car or healthy people.

This type of problem is classified as imbalanced data. And, while there isn’t a number that defines it, you know that your data is imbalanced when your class distribution is skewed.

At this point you might be thinking, if my data represents reality then this is a good thing. Well, your machine learning (ML) algorithms beg to differ.

I am not going to dive into the details of problems associated with imbalanced classification (if you wish to know more this particular topic you can read about it here) but bear with me for a second:

Imagine that your ML algorithm needs to “see” healthy patients 1000 times to recognize what a healthy patient is, and the ratio of unhealthy to healthy patients is 1:1000. But you want it to recognize patients with diseases as well, so you will need to “feed” him 1000 unhealthy patients as well. Which means you actually need to have a database of 1000000 patients so that your ML algorithm has enough information to recognize both types of patients.

Sidenote: this is merely an illustration, under the hood things don’t happen exactly like that.

I am betting by now you are starting to grasp how this problem can scale up pretty quickly.

Thankfully our dearest statisticians friends have developed methods to help us solve this issue.

In fact, since the 1930s several methods have been around, each with its own use case, from permutations tests to bootstrap, there are a lot of options.

And if you are not new to data science, chances are you already applied some resampling techniques in your model training process, like cross validation.

#data-augmentation #data-science #imbalanced-data #gans #synthetic-data

The new step forward in synthetic data
1.10 GEEK