Enrich your train fold with a custom sampler inside an imblearn pipeline

When it comes to small data sets, life can get complicated. In medicine, a data set can easily consist of less than 100 patients/rows. But in the other dimension it can become pretty large — easily over 3000 features.

However, sometimes you will find a way to augment your data, which — in my case — means that you multiply your data set with slightly different feature values. That way you can multiply your training data. Of course this is a simplified version of what I really did, but that’s a different story. There are different ways to augment your data, but this article is not intended to cover the wide field of data augmentation.

But you have to be careful, data augmentation is a powerful weapon that has to be used with caution. And even used correctly it is not guaranteed to boost the performance of your estimator.

Btw, I wouldn’t be able to write this article without the help of my colleagues and people from StackOverflow!

#python #machine-learning #crossvalidation #data-augmentation #sklearn

towardsdatascience.com

Enrich your train fold with a custom sampler inside an imblearn pipeline