Extracting tabular data from PDFs is hard. But what is even a bigger problem is that. A lot of open data is available as PDF files. This open data is crucial for analysis and getting vital insights. However, accessing such data becomes a challenge. For instance, let’s look at an important report released by the National Agricultural Statistics Service (NASS), which deals with the principal crops planted in the U.S:

Report Source: https://www.nass.usda.gov/Publications/Todays_Reports/reports/pspl0320.pdf

For any sort of analysis, the starting point would be get the table with details and convert it to a format which can be ingested by most of the available tools. As you can see above a mere copy-paste in this case doesn’t work. Most of the times, the headers are not in correct place, some of the numbers are lost in tranisition and various other such problems.This makes PDFs somewhat tricky to handle and apparently, there is a reason for that. We’ll go over that, but let’s first try and understand the concept of a PDF file.

