Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. [Wikipedia]

Why Data Cleansing/Cleaning?

If you talked to a Data Scientist or Data Analyst who had lots of experience in building machine models, he will tell you that preparing data takes a very long time and it is very important.

Machine learning models designed to get a huge amount of data and find patterns in this data to be able to provide intelligent decisions.

Let’s assume that you are building an ML model to classify between apples and orange images. If you input all your data with only orange images, then the model will not be able to predict the apple because it does not have enough data to learn and define patterns for apples.

This example till us “Garbage in Garbage out”

GIGO

If data fed into the ML model is of poor quality the model will be of poor quality.

Problems with Data

Data Problems

How to tackle each one of the above problems?

Thinking

Insufficient data

[Sherlock Holmes:] I had come to an entirely erroneous conclusion, my dear Watson, how dangerous it always is to reason from insufficient data.

-The Adventure of the Speckled Band

Data Problems

Models trained with insufficient data perform poorly in prediction. if you have just a few records for your ML model, it will lead you to one of two below know issues in ML modeling

Overfitting:_ Read too much for too little data._

Underfitting:_ Build an overly simplistic model from available data._

In real-world Insufficient Data problem, is a common struggle for the project, you might find the relevant data may not available and even if it is the actual processing of collecting the data it is very difficult and time-consuming.

The truth there is no great solution to deal with insufficient data, you simply need to find more data sources and wait for long till you have the relevant data collected.

**But, **there is something you can do to work around this problem but note that the techniques will discuss are not widely applicable for all use cases.

Now, What we can do if we have small datasets?

Model Complexity:_ if you have small data you can choose to work with a simpler model, a simpler model works better with fewer data._

Transfer Learning:_ if you are working with neural networks deep learning techniques you can use the transfer learning._

Data augmentation:_ you can try to increase the amount of data by using the data augmentation techniques, it usually uses with image data._

Synthetic Data: understand the kind of data that you need to build your model and use the statistical properties of that data and generate Synthetic artificial Data.

Model Complexity

Every machine learning algorithm has its own set of parameters. for example, simple linear regression vs decision tree regression.

If you have less data, choose a simpler model with fewer model parameters. A simpler model is less susceptible to overfitting your data and memorizing patterns in your data.

Some of the machine learning models are simple with few parameters like Naïve Bayes Classifier or Logistic regression model. Decision trees have many more parameters and consider as a complex model.

Another option to train your data using ensemble techniques.

Ensemble Learning:_ Machine learning technique in which several learners are combined to obtain a better performance than any of the learners individual._

Ensemble Learning

Transfer Learning

If you are working with Neural Networks and you don’t have enough data to train your model transfer learning is may solve this problem.

Transfer Learning:_ the practice of re-using a trained neural network that solves a problem similar to yours, usually leaving the network architecture unchanged and re-using some or all of the model weight._

Transfer Learning

Transferred knowledge is especially useful with the new dataset when it is small and not sufficient to train a model from scratch.

Data Augmentation

Pikachu

Data Augmentation techniques allow you to increase the number of training samples and it is typically used with image data, you take all the images you are working with and perturb and disturb those images in some way to generate new images.

You can perturb these images by applying scaling, rotation, and affine transform. And these image processing options are often use preprocessing techniques to make your image classification models build using CNN or computational neural networks more robust, they can also be used to generate additional samples for you to work with.

#data-cleansing #data-science #data-cleaning #data analysis #data analysis

Why Data Cleansing/Cleaning?

Problems with Data

Insufficient data

medium.com

The Imperative of Data Cleansing — Part 1