Scraping a table in a PDF, reliably and then test data quality

Introduction

Suppose you need to ingest some data into your data warehouse and after further discussions with your stakeholders the source of this data is a PDF document. Fortunately, this is pretty easy to do using a Python package called tabula-py. In this article I’m going to walk you through how you can scrape a table embedded in a PDF file, unit test that data using Great Expectations and then if valid, save the file in S3 on AWS. You can find the full source code to this article on Github or a working example on Google Colab.

#data-engineering #aws #python #data-quality

Introduction

towardsdatascience.com

Scraping a table in a PDF, reliably and then test data quality