Learn how to correctly prepare Data for Machine Learning

Data preparation plays an important role in your workflow. You need to transform the data in a way that a computer will be able to work with it.


Steps in Data Preprocessing

Any database is a collection of data objects. You can also call them data samples, events, observations, or records. However, each of them is described with the help of different characteristics. In data science lingo, they are called attributes or features.

Data preprocessing is a necessary step before building a model with these features.

Graphic showing the hierarchical relationship among the main headings and subheadings of this article

Image source: Author

It usually happens in stages. Let’s have a closer look at each of them.


Data Quality Assessment

First of all, you need to have a good look at your database and perform a data quality assessment. A random collection of data often has irrelevant bits. Here are some examples.

Mismatching in data types

Quite often, you might mix together datasets that use different data formats; hence, the mismatching: integer vs. float or UTF8 vs ASCII.

Different dimensions of data arrays

When you aggregate data from different datasets, for example, from five different arrays of data for voice recognition, three fields that are present in one of them can be missing in four other arrays.

Mixture of data values

Let’s imagine that you have data collected from two independent sources. As a result, the gender field has two different values for women: woman and female.

To clean this dataset, you have to make sure that the same name is used as the descriptor within the dataset (it can be woman in our case).

Outliers in the dataset

Within 200 years of daily temperature observations for New York, there were several days with very low temperatures in summer.

Outliers are very dangerous. They can strongly influence the output of a machine learning model. Usually, the researchers evaluate the outliers to identify whether each particular record is the result of an error in the data collection or a unique phenomenon which should be taken into consideration for data processing.

Missing data

You may also notice that some important values are missing. These problems arise due to the human factor, program errors, or other reasons. They will affect the accuracy of the predictions, so before going any further with your database, you need to do data cleaning.

#data-science #artificial-intelligence #machine-learning #programming #developer

How to Correctly Prepare Data for Machine Learning
1.80 GEEK