Living in this modern age, we face thousands of data every day. When we start our day by waking up in the morning, the first thing we do for most of us is to check the smartphone to see “is there an important email that I haven’t read?” or we check our social media to see “is my friend having a birthday today?” or do we check the news on the smartphone to see “what’s hot today?” Those are all examples of the application of data. Can you imagine if you face raw data that has not been processed before?

WHAT IS DATA PREPROCESSING

“A simple definition could be that data preprocessing is a data mining technique to turn the raw data gathered from diverse sources into cleaner information that’s more suitable for work. In other words, it’s a preliminary step that takes all of the available information to organize it, sort it, and merge it.”

Raw data can have missing or inconsistent values as well as present a lot of redundant information. The most common problems you can find with raw data can be divided into 3 groups:

· Missing data: you can also see this as inaccurate data since the information that isn’t there creates gaps that might be relevant to the final analysis. Missing data often appears when there’s a problem in the collection phase, such as a glitch that caused a system’s downtime, mistakes in data entry, or issues with biometrics use, among others.

· Noisy data: this group encompasses erroneous data and outliers that you can find in the data set but that is just meaningless information. Here you can see noise made of human mistakes, rare exceptions, mislabels, and other issues during data gathering.

· Inconsistent data: inconsistencies happen when you keep files with similar data in different formats and files. Duplicates in different formats, mistakes in codes of names, or the absence of data constraints often lead to inconsistent data, that introduces deviations that you have to deal with before analysis.

If you didn’t take care of those issues, the final output would be plagued with faulty insights. That’s especially true for more sensitive analysis that can be more affected by small mistakes, like when it’s used in new fields where minimal variations in raw data can lead to wrong assumptions.

WHY WE NEED DATA PROCESSING

By now, you’ve surely realized why your data preprocessing is so important. Since mistakes, redundancies, missing values, and inconsistencies all compromise the integrity of the set, you need to fix all those issues for a more accurate outcome. Imagine you are training a Machine Learning algorithm to deal with your customers’ purchases with a faulty dataset. Chances are that the system will develop biases and deviations that will produce a poor user experience.

Thus, before using that data for the purpose you want, you need it to be as organized and “clean” as possible. There are several ways to do so, depending on what kind of problem you’re tackling. Ideally, you’d use all of the following techniques to get a better data set. This picture below will help you to understand the steps you can do in Data Preprocessing.

#machine-learning #data-transformation #data-preprocessing #data-reduction #data-cleaning #data analysis

WHAT IS DATA PREPROCESSING

WHY WE NEED DATA PROCESSING

medium.com

Let’s Dive Into Data Preprocessing