Keep your data clean with data testing

In software engineering, it is best practice to write tests with language-specific frameworks to reduce errors and improve code quality. However, in the world of data science and data engineering, the quality of the pipelines and the models you are building does not only depend on the written code, it mostly depends on the data you are using. In this article, we will see how we can write tests for our input data to avoid unpleasant surprises. So, we can guarantee the correct behavior of our machine learning models or ETL pipelines building up on this data.

In this article, we will focus on Python code and use the great-expectations package for testing. We will concentrate on Pandas DataFrames, but tests for PySpark and other tools are also supported by great-expectations.

Testing an example dataset

For this example, I decided to use the Covid-19 data coming from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

**Why this dataset? **I used this dataset because it consists of several batches. There is one .csv file for every day since the beginning of Covid 19. The quality of the data and the amount of information is changing over time. So, we can build our tests on a subset of the data, which is available on a specific date. Later, we can apply these tests to new, unseen data and assure that the data content is as expected.

#test-automation #pandas #data-science #testing #etl

Testing an example dataset

towardsdatascience.com

Keep your data clean with data testing