Scraping a table in a PDF, reliably and then test data quality

Scraping a table in a PDF, reliably and then test data quality

In this article I’m going to walk you through how you can scrape a table embedded in a PDF file, unit test that data using Great Expectations and then if valid, save the file in S3 on AWS.

Introduction

Suppose you need to ingest some data into your data warehouse and after further discussions with your stakeholders the source of this data is a PDF document. Fortunately, this is pretty easy to do using a Python package called tabula-py. In this article I’m going to walk you through how you can scrape a table embedded in a PDF file, unit test that data using Great Expectations and then if valid, save the file in S3 on AWS. You can find the full source code to this article on Github or a working example on Google Colab.

data-engineering aws python data-quality

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Managing Data as a Data Engineer:  Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.

10 Data Engineering Practices to Ensure Data and Code Quality

10 Data Engineering Practices to Ensure Data and Code Quality. What I learned from working with data at various companies

Managing Data as a Data Engineer — Understanding Users

Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.

Data Quality Testing Skills Needed For Data Integration Projects

Data Quality Testing Skills Needed For Data Integration Projects. Data integration projects fail for many reasons. Risks can be mitigated when well-trained testers deliver support. Here are some recommended testing skills.