Enrich your train fold with a custom sampler inside an imblearn pipeline

When it comes to small data sets, life can get complicated. In medicine, a data set can easily consist of less than 100 patients/rows. But in the other dimension it can become pretty large — easily over 3000 features.

However, sometimes you will find a way to augment your data, which — in my case — means that you multiply your data set with slightly different feature values. That way you can multiply your training data. Of course this is a simplified version of what I really did, but that’s a different story. There are different ways to augment your data, but this article is not intended to cover the wide field of data augmentation.

But you have to be careful, data augmentation is a powerful weapon that has to be used with caution. And even used correctly it is not guaranteed to boost the performance of your estimator.

Btw, I wouldn’t be able to write this article without the help of my colleagues and people from StackOverflow!

Where to use augmented data in your process

Once you have a set of augmented data to enrich your original data set, you will ask yourself how and at which point to merge them. Typically you are using sklearn and its modules to evaluate your estimator or search for optimal hyper-parameters. Popular modules including RandomizedSearchCV or cross_validate have the option to pass a cross validation method like KFold. By utilizing a cross validation method to measure the performance of your estimator, your data is split in a train and a test set. This is done dynamically under the hood of the sklearn methods.

This is usually fine and it means that you don’t have to bother with it more than necessary. There is just one problem when you want to use augmented data with a cross validation method — you don’t want to have augmented data in your test fold. Why is that? You want to know how your estimator performes in reality, and your augmented data does not reflect the reality. Additionally, you want to only augment the data in your train set and don’t want to have augmented data in your train fold.

#python #machine-learning #crossvalidation #data-augmentation #sklearn

Where to use augmented data in your process

medium.com

Enrich your train fold with a custom sampler inside an imblearn pipeline