Data Science isn’t only about developing models. A lot of the work involves cleaning data and selecting features. Plugging features into a model that have similar distributions but significantly different means, or are on vastly different scales can lead to erroneous predictions. A common solution to these problems is to first “normalize” features to eliminate significant differences in mean and variance.

The term “normalization” can be misleading (and also shouldn’t be confused with database normalization), because it has come to mean many things in statistics. There is however, a common theme among normalization techniques which is to bring separate datasets into alignment for easier comparison. The two techniques we’ll focus on are Residual Extraction, which shifts the datasets’ means, and Re-scaling which stretches and squeezes the values in the datasets to fit on a scale from 0 to 1. Needless to say, both of these techniques will eliminate the units applied to the datasets. Thankfully, the shifting and scaling techniques can both be accomplished easily in Python and calculated efficiently using the NumPy Python package.

Extracting Residuals

Let’s first explore the Residual Extraction technique. A residual is the relative difference between a value in a dataset and the dataset’s mean. This technique is useful when you have datasets with similar distributions but significantly different means, making comparisons between the datasets difficult. For example, let’s say we have an exam that’s taken by two different classes of equal size. The questions are the same, in the same order, and have the same answers. However, the average scores between the two classes are different. Class 1 averaged an 82 on the test and Class 2 averaged a 77. How can we combine the scores from the two classes?

#normalization #data-science #numpy #python #statistics

Normalization Techniques in Python Using NumPy
2.10 GEEK