When the topic comes to data consistency and quality control in data engineering, it is always one of the most troublesome tasks to do. I suppose, as a data engineer, we have all experienced data inconsistency issues such as matching text string like NYC, new york and new york city, or without knowing the business context in the first place, trying to clip the value of some columns of data into a specific range, or the ML model you have deployed into production does not perform due to unexpected drifted input dataset that you have not observed before training. Normally, in an organization, with assistance from a data catalog, it can increase data transparency and help us understand these issues. However, the service does not come for free and it works the best with enterprise data sources like ERP, CRM, or standard RDBMS systems. Unfortunately, that is not the only type of data source DE has to deal with on a daily basis especially while we utilize Pyspark in the data processing.

What are the alternatives? I find two great projects which can help statistically summarize dataframe, Deequfrom AWS,

and Great Expectation. Both tools can perform rule-based data profiling operations onto dataframe and generate data validation reports. Deequ is meant to be used mainly in Scala env. You need to define an “AnalysisRunner” object to add a series of predefined analyzers such as compliance, size, completeness, uniqueness, etc. A sample function would look like

#data-engineering #data-profiling #pyspark #python

Quality Control Your Next Pyspark Dataframe
2.25 GEEK