Quality Assurance Testing is one of the key areas in Bigdata
Data quality issues may ruin the success of many Big Data, data lake, ETL projects. Whether the data is big or small, the need for data quality doesn’t change. High-quality data is the absolute driver to get insights from it. The quality of data is measured based on whether it satisfies the business by deriving the necessary insights.
In this blog, we are going to see the steps to ensure the quality of data is correct when you migrate the data from source to destination.
When there is a mismatch between source and sink, how to get specific corrupt data in the whole data which may have 3000+ columns and Millions of records?
Let’s see each step in action…
We have migrated the data from MySQL to Data Lake. The quality of data needs to be verified before it is consumed by downstream applications.
For demo purposes, I have read sample customer data (1000 records) in Spark Dataframe. Though the demo is with a small volume of data, this solution can be scaled to the humongous volume of data.
#data-engineering #data-quality #validation #big-data #migration