Quality Control Your Next Pyspark Dataframe

Quality Control Your Next Pyspark Dataframe

How to profile a dataframe in pyspark? Quality Control Your Next Pyspark Dataframe

When the topic comes to data consistency and quality control in data engineering, it is always one of the most troublesome tasks to do. I suppose, as a data engineer, we have all experienced data inconsistency issues such as matching text string like NYC, new york and new york city, or without knowing the business context in the first place, trying to clip the value of some columns of data into a specific range, or the ML model you have deployed into production does not perform due to unexpected drifted input dataset that you have not observed before training. Normally, in an organization, with assistance from a data catalog, it can increase data transparency and help us understand these issues. However, the service does not come for free and it works the best with enterprise data sources like ERP, CRM, or standard RDBMS systems. Unfortunately, that is not the only type of data source DE has to deal with on a daily basis especially while we utilize Pyspark in the data processing.

What are the alternatives? I find two great projects which can help statistically summarize dataframe, Deequfrom AWS,

and Great Expectation. Both tools can perform rule-based data profiling operations onto dataframe and generate data validation reports. Deequ is meant to be used mainly in Scala env. You need to define an “AnalysisRunner” object to add a series of predefined analyzers such as compliance, size, completeness, uniqueness, etc. A sample function would look like

data-engineering data-profiling pyspark python

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Managing Data as a Data Engineer:  Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.

Managing Data as a Data Engineer — Understanding Users

Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.

Intro to Data Engineering for Data Scientists

Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews

Know the Difference Between a Data Scientist and a Data Engineer

Know the Difference Between a Data Scientist and a Data Engineer. Big data engineer certification and data science certification programs stand resourceful for those looking to get into the data realm.