Data Reconciliation is defined as the process of verification of data during data migration. In this process target data is compared against source data to ensure that the migration happens as expected.

Need for Data Reconciliation

  • You cannot trust your data without data verification.
  • Comparing record counts and fill rates does not always work.
  • Untrustworthy data leads to flawed insights.

Data Reconciler is a data reconciliation tool that checks for the accuracy of your data. Before taking you through the technical implementation, I would like to show you the output of the Reconciliation tool. You can run this code by yourself by following the instructions in next section.

The input dataset has 4 fields with a record count of 50 million records sizing about 1 GB in parquet format. After performing reconciliation on this dataset, we get the following output.

#spark-application #spark-quality-checks #spark #reconciliation #data-science

Data Reconciliation in Spark
7.20 GEEK