Data Validation Framework in Apache Spark

Data Validation Framework in Apache Spark

Quality Assurance Testing is one of the key areas in Bigdata. Data quality issues may ruin the success of many Big Data, data lake, ETL projects.

Quality Assurance Testing is one of the key areas in Bigdata

Data quality issues may ruin the success of many Big Data, data lake, ETL projects. Whether the data is big or small, the need for data quality doesn’t change. High-quality data is the absolute driver to get insights from it. The quality of data is measured based on whether it satisfies the business by deriving the necessary insights.

Image for post

In this blog, we are going to see the steps to ensure the quality of data is correct when you migrate the data from source to destination.

Steps involved

  1. Row and Column count
  2. Column Names Check
  3. Subset Data Check without Hashing
  4. Stats Comparison — Min, Max, Mean, Median, Stddev, 25th, 50th, 75th percentile
  5. SHA256 Hash Validation on Whole data

Debugging

When there is a mismatch between source and sink, how to get specific corrupt data in the whole data which may have 3000+ columns and Millions of records?

Let’s see each step in action…


Scenario

We have migrated the data from MySQL to Data Lake. The quality of data needs to be verified before it is consumed by downstream applications.

For demo purposes, I have read sample customer data (1000 records) in Spark Dataframe. Though the demo is with a small volume of data, this solution can be scaled to the humongous volume of data.

data-engineering data-quality validation big-data migration

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

How you’re losing money by not opting for Big Data Services?

Big Data Analytics is the next big thing in business, and it is a reality that is slowly dawning amongst companies. With this article, we have tried to show you the importance of Big Data in business and urge you to take advantage of this immense...

Dream of Becoming a Big Data Engineer?

We ain’t doing the same thing.Dream of Becoming a Big Data Engineer? Discover What Sets Us Apart From Software Engineers