Data Validation Framework in Apache Spark

Quality Assurance Testing is one of the key areas in Bigdata

Data quality issues may ruin the success of many Big Data, data lake, ETL projects. Whether the data is big or small, the need for data quality doesn’t change. High-quality data is the absolute driver to get insights from it. The quality of data is measured based on whether it satisfies the business by deriving the necessary insights.

Image for post

In this blog, we are going to see the steps to ensure the quality of data is correct when you migrate the data from source to destination.

Steps involved

Row and Column count
Column Names Check
Subset Data Check without Hashing
Stats Comparison — Min, Max, Mean, Median, Stddev, 25th, 50th, 75th percentile
SHA256 Hash Validation on Whole data

Debugging

When there is a mismatch between source and sink, how to get specific corrupt data in the whole data which may have 3000+ columns and Millions of records?

Let’s see each step in action…

Scenario

We have migrated the data from MySQL to Data Lake. The quality of data needs to be verified before it is consumed by downstream applications.

For demo purposes, I have read sample customer data (1000 records) in Spark Dataframe. Though the demo is with a small volume of data, this solution can be scaled to the humongous volume of data.

#data-engineering #data-quality #validation #big-data #migration

Steps involved

Debugging

Scenario

medium.com

Data Validation Framework in Apache Spark