How to extract tables from PDF using Python Pandas and tabula-py

How to extract tables from PDF using Python Pandas and tabula-py

How to extract tables from PDF using Python Pandas and tabula-py. Do you know yet? If you are still wondering about it then this article is for you.

A quick and ready script to extract repetitive tables from PDF

This tutorial is an improvement of my previous post, where I extracted multiple tables without Python pandas. In this tutorial, I will use the same PDF file, as that used in my previous post, with the difference that I manipulate the extracted tables with Python pandas.

The code of this tutorial can be downloaded from my Github repository.

Almost all the pages of the analysed PDF file have the following structure:

Image by Author

In the top-right part of the page, there is the name of the Italian region, while in the bottom-right part of the page there is a table.

Image by Author

I want to extract both the region names and the tables for all the pages. I need to extract the bounding box for both the tables. The full procedure to measure margins is illustrated in my previous post, section Define margins.

This script implements the following steps:

  • define the bounding box, which is represented through a list with the following shape: [top,left,bottom,width]. Data within the bounding box are expressed in cm. They must be converted to PDF points, since tabula-py requires them in this format. We set the conversion factor fc = 28.28.
  • extract data using the read_pdf() function
  • save data to a pandas dataframe.

In this example, we scan the pdf twice: firstly to extract the regions names, secondly, to extract tables. Thus we need to define two bounding boxes.

data-collection tabula-py data-science pdf-extraction python how to extract tables from pdf using python pandas and tabula-py

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

PANDAS: Most Used Functions in Data Science

PANDAS: Most Used Functions in Data Science. Do you know yet? If you are still wondering about it then this article is for you.

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

5 Examples to Compare Python Pandas and R data.table

In this tutorial, we'll learn 5 Examples to Compare Python Pandas and R data.table. Read through this article and see which one is better for your project.

How To Build A Data Science Career In 2021

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

Data Science with Python Certification Training in Chennai

Enroll in our Data Science with Python training in Chennai. Best Data Science with Python Training courses in Chennai for 100% Job Placements Support.