Data Augmentation in Python: Everything You Need to Know

In machine learning (ML), if the situation when the model does not generalize well from the training data to unseen data is called overfitting. As you might know, it is one of the trickiest obstacles in applied machine learning.

The first step in tackling this problem is to actually know that your model is **overfitting_. _**That is where proper cross-validation comes in.

After identifying the problem you can prevent it from happening by applying regularization or training with more data. Still, sometimes you might not have additional data to add to your initial dataset. Acquiring and labeling additional data points may also be the wrong path. Of course, in many cases, it will deliver better results, but in terms of work, it is time-consuming and expensive a lot of the time.

That is where Data Augmentation (DA) comes in.

In this article we will cover:

What is Data Augmentation – definition, the purpose of use, and techniques
Built-in augmentation methods in DL frameworks – TensorFlow, Keras, PyTorch, MxNet
Image DA libraries – Augmentor, Albumentations, ImgAug, AutoAugment, Transforms
Speed comparison of these libraries
Best practices, tips, and tricks

What is Data Augmentation

Data Augmentation is a technique that can be used to artificially expand the size of a training set by creating modified data from the existing one. It is a good practice to use DA if you want to prevent overfitting, or the initial dataset is too small to train on, or even if you want to squeeze better performance from your model.

Let’s make this clear, Data Augmentation is not only used to prevent overfitting. In general, having a large dataset is crucial for the performance of both ML and Deep Learning (DL) models. However, we can improve the performance of the model by augmenting the data we already have. It means that Data Augmentation is also good for enhancing the model’s performance.

In general, DA is frequently used when building a DL model. That is why throughout this article we will mostly talk about performing Data Augmentation with various DL frameworks. Still, you should keep in mind that you can augment the data and for the ML problems as well.

You can augment:

Audio
Text
Images
Any other types of data

We will focus on image augmentations as those are the most popular ones. Nevertheless, augmenting other types of data is as efficient and easy. That is why it’s good to remember some common techniques which can be performed to augment the data.

Data Augmentation techniques

We can apply various changes to the initial data. For example, for images we can use:

Geometric transformations – you can randomly flip, crop, rotate or translate images, and that is just the tip of the iceberg
Color space transformations – change RGB color channels, intensify any color
Kernel filters – sharpen or blur an image
Random Erasing – delete a part of the initial image
Mixing images – basically, mix images with one another. Might be counterintuitive but it works

For text there are:

Word/sentence shuffling
Word replacement – replace words with synonyms
Syntax-tree manipulation – paraphrase the sentence to be grammatically correct using the same words
Other described in the article about Data Augmentation in NLP

#computer vision #deep learning #machine learning tools #data

What is Data Augmentation

Data Augmentation techniques

neptune.ai

Data Augmentation in Python: Everything You Need to Know - neptune.ai