Measuring data quality is not something new. there are many data profiling tools available on the market that help data analysts understand gaps in their data and dig into root — causes.

With data Lakes and warehouse’s high importance and a growing number of activities around data, data quality is something that not only experts users should be aware of. Emerging of modern BI and Self-service analytics roles like data analyst, data scientist, or data engineer that are not into data quality details and could use simple metrics to get a quality overview of datasets they want to use.

How to design a good data quality score?

It should be seen from different angles and covered different dimensions the formula is not so clear. Let’s see the requirements it shall fulfill:

1. Simple to understand. A user looking in to catalog of large number data sets should quickly get an initial understanding of how trustworthy it is without drilling down to details

2. Scaling proof — if the score was run on a smaller but representative sample it should more or less similar.

3. Comparable with other data quality scores. Metrics can be different for different datasets but it should give users high-level comparison even if that sets are much different in size.

4. Normalized — Clearly provided highest and lowest score and benchmark to see what can be expected and how far we are from perfect

Expectations for columns/attributes :

  • marked as mandatory to be completed

  • completed inline with the definition of valid value

  • completed with values defined in reference data source

  • relation between data sets setting the dependencies or correlations between column

How to define Data quality issues — report of data quality problem type on attribute or record or group of elements, if 15 out of 100 mandatory values are missing then we can say data quality is 85%. Confidence represents the probability that data quality issue is a real business problem

Data quality for a single attribute in the record (for one cell)

  • it’s True or false value is either fulfilling standard or not.

Data Quality score for attribute

  • certain attribute or column score based on rules set for these attributes.

#data-analysis #data-quality #data-governance #data-management #data quality

Designing a Data quality index
1.30 GEEK