How to profile a dataframe in pyspark? Quality Control Your Next Pyspark Dataframe
When the topic comes to data consistency and quality control in data engineering, it is always one of the most troublesome tasks to do. I suppose, as a data engineer, we have all experienced data inconsistency issues such as matching text string like NYC, new york and new york city, or without knowing the business context in the first place, trying to clip the value of some columns of data into a specific range, or the ML model you have deployed into production does not perform due to unexpected drifted input dataset that you have not observed before training. Normally, in an organization, with assistance from a data catalog, it can increase data transparency and help us understand these issues. However, the service does not come for free and it works the best with enterprise data sources like ERP, CRM, or standard RDBMS systems. Unfortunately, that is not the only type of data source DE has to deal with on a daily basis especially while we utilize Pyspark in the data processing.
What are the alternatives? I find two great projects which can help statistically summarize dataframe, Deequfrom AWS,
and Great Expectation. Both tools can perform rule-based data profiling operations onto dataframe and generate data validation reports. Deequ is meant to be used mainly in Scala env. You need to define an “AnalysisRunner” object to add a series of predefined analyzers such as compliance, size, completeness, uniqueness, etc. A sample function would look like
In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.
Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.
Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.
Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews
Know the Difference Between a Data Scientist and a Data Engineer. Big data engineer certification and data science certification programs stand resourceful for those looking to get into the data realm.