Working With PDFs in Python

Python’s flexibility and interactivity lie in the fact that we can use any form of data. From JSON, excel sheets, text files, APIs, or even PDFs, Python lets us play with any form of data.

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.

In this post, I will be particularly talking about PyPDF2 library that is used to create PDF or extract text out of them in Python.

Extracting text using PyPDF2

We will be starting off with importing the PyPDF2 library and reading the PDF file for extraction.

from PyPDF2 import PdfFileReader
pdf_path='sample.pdf'
pdf = PdfFileReader(str(pdf_path))

If you run the “pdf” variable, it will return a PyPDF2 object.

print(pdf)
[Output]: <PyPDF2.pdf.PdfFileReader at 0x112f3a8d0>

I have imported a sample PDF document with 2 pages. The first page looks like the image below.

#python #devops

Extracting text using PyPDF2

medium.com

Working With PDFs in Python