This tutorial will show how to bring Pandas libraries to Spark and Dask with Fugue. In this example, we’ll use the Pandera data validation library on Spark.

Data Validation

Data validation is having checks in place to make sure that data comes in the format and specifications that we expect. As data pipelines become more interconnected, the chances of changes unintentionally breaking other pipelines also increase. Validations are used to guarantee that upstream changes will not break the integrity of downstream data operations. Common data validation patterns include checking for NULL values or checking data frame shape to ensure transformations don’t drop any records. Other frequently used operations are checking for column existence and schema. Using data validation avoids silent failures of data processes where everything will run successfully but provide inaccurate results.

Data Validation can be placed at the start of the data pipeline to make sure that any transformations happen smoothly, and it can also be placed at the end to make sure everything is working well before output gets committed to the database. This is where a tool like Pandera can be used. For this post, we’ll make a small Pandas DataFrame to show examples. There are three columns, State, City, and Price.

#spark #pandas #fugue #data-validation #data-science

Using Pandera on Spark for Data Validation through Fugue
1.95 GEEK