PDF Processing with Python

PDF Processing with Python

PDF Processing with Python - As you know PDF processing comes under text analytics . Most of the Text Analytics Library or frameworks are designed in Python only .

PDF Processing with Python - As you know PDF processing comes under text analytics . Most of the Text Analytics Library or frameworks are designed in Python only .

Introduction

Being a high-level, interpreted language with a relatively easy syntax, Python is perfect even for those who don’t have prior programming experience. Popular Python libraries are well integrated and provide the solution to handle unstructured data sources like Pdf and could be used to make it more sensible and useful

PDF is one of the most important and widely used digital media. used to present and exchange documents. PDFs contain useful information, links and buttons, form fields, audio, video, and business logic.

Why Python for PDF processing

As you know PDF processing comes under text analytics . Most of the Text Analytics Library or frameworks are designed in Python only . This gives a leverage on text analytics . One more thing you can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing . Unless they are proving explicit interface for this . We have to convert pdf to text first.

Python Librairies for PDF Processing

As a Data Scientist , You may not stick to data format . PDFs is good source of data . Most of the organization release their data in PDFs only . As AI is growing , We need more data for prediction and classification . Hence ignoring PDFs as data source could be a blunder . Actually PDF processing is little difficult but we can leverage the below API for making it easier .

In this section, we will discover the Top Python PDF Library:

PDFMiner

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

PyPDF2

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

pdfrw

Quite similar like above two mention . Apart from those similarity , pdfrw has its own USPs (Unique selling Points) . Actually the requirement of API depends on use case.

Slate

It is wrapper Implementation of PDFMiner . No API is perfect , There were few short coming in PDFMiner . Slate beautifully address them.

Setup Environment

Step 1: **Select Version of Python to Install from [Python.org](https://www.python.org/downloads/ "Python.org**") .

*Step 2: *Download Python Executable Installer.

*Step 3: *Run Executable Installer.

*Step 4: *Verify Python Was Installed On Windows.

*Step 5: *Verify Pip Was Installed.

Step 6: Add Python Path to Environment Variables (Optional).

Step 7: Install Python extension for your IDE.

I am working with Python 3.7 in visual studio code. For more information about how to setup your environment and select your python interepter to start coding with VS Code, check** *Getting Started with Python in VS Code documentation. ** *Now you’ll be able to execute python scripts with your IDE.

Step 8 *: Install *pdfminer.six

pip install pdfminer.six

Step 9 *: Install *PyPDF2

pip install PyPDF2

Done! Now, you can start processing pdf documents with python.

Merging Multiple PDF Documents

Now that we have a bunch of PDFs, let’s learn how we might take them and merge them back together. One useful use case for doing this is for businesses to merge their dailies into a single PDF. I have needed to merge PDFs for work . One project that sticks out in my mind is scanning documents in. Depending on the scanner you have, you might end up scanning a document into multiple PDFs, so being able to join them together again can be wonderful.

When the original PyPdf came out, the only way to get it to merge multiple PDFs together was like this:

# Merging Multiple PDF Documents with Python 3.7 and PyPDF2
    import glob
    from PyPDF2 import PdfFileWriter, PdfFileReader


    def merger(output_path, input_paths):
    pdf_writer = PdfFileWriter()
    for path in input_paths:
    pdf_reader = PdfFileReader(path)
    for page in range(pdf_reader.getNumPages()):
    pdf_writer.addPage(pdf_reader.getPage(page))
    with open(output_path, 'wb') as fh:
    pdf_writer.write(fh)


    if __name__ == '__main__':
    paths = glob.glob('pdf_files_path/I*.pdf')
    #Retrieve all pdfs start with I
    paths.sort()
    merger('output_pdf_file_path/pdf_merger.pdf', paths)

Splitting Merged PDF Document

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones, let’s find out how:

# Splitting Merged PDF Document with Python 3.7 and PyPDF2
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def pdf_splitter(path):
fname = os.path.splitext(os.path.basename(path))[0]
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output_filename = 'pdf_directory/{}_page_{}.pdf'.format(fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('Created: {}'.format(output_filename))

if __name__ == '__main__':
path = "Specify your pdf path here"
pdf_splitter(path)

Extract Text from PDF

The PDFMiner library excels at extracting data and coordinates from a PDF. In most cases, you can use the included command-line scripts to extract text and images (<strong>pdf2txt.py</strong>).

The package includes the <strong>pdf2txt.py</strong> command-line command, which you can use to extract text and images. The command supports many options and is very flexible. Some popular options are shown below.

See the usage information for complete details

pdf2txt.py [options] filename.pdf
Options:
    -o output file name
    -p comma-separated list of page numbers to extract
    -t output format (text/html/xml/tag[for Tagged PDFs])
    -O dirname (triggers extraction of images from PDF into directory)
    -P password
pdf2txt.py [options] filename.pdf
Options:
    -o output file name
    -p comma-separated list of page numbers to extract
    -t output format (text/html/xml/tag[for Tagged PDFs])
    -O dirname (triggers extraction of images from PDF into directory)
    -P password

That’s right, you can even use the command to convert PDF to TXT! For example, say you want the TXT version of the first and third pages of your PDF, including images

pdf2txt.py -O myoutput -o myoutput/myfile.txt -t txt-p 1,3 myfile.pdf

If you are writing Python code and you don’t want to shell out to the command line with <strong>os.system</strong>or <strong>subprocess</strong>, you use the package as a library.

For example, to extract text from PDF you need:

# Extract Text from PDF with Python 3.7 and PDFMiner
    from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from io import BytesIO
    def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)
    for page in PDFPage.get_pages(filepath, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    filepath.close()
    device.close()
    retstr.close()
    return text


    if __name__ == "__main__":
    text = pdf_to_text("Specify your pdf path here")
    print(text)

python data-science

What's new in Bootstrap 5 and when Bootstrap 5 release date?

How to Build Progressive Web Apps (PWA) using Angular 9

What is new features in Javascript ES2020 ECMAScript 2020

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Random Password Generator Online

HTML Color Picker online | HEX Color Picker | RGB Color Picker

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...

Python For Data Science - How to use Data Science with Python

This Edureka video on 'Python For Data Science - How to use Data Science with Python - Data Science using Python ' will help you understand how we can use python for data science along with various use cases. What is Data Science? Why Python? Python Libraries For Data Science. Roadmap To Data Science With Python. Data Science Jobs and Salary Trends

32 Data Sets to Uplift your Skills in Data Science | Data Sets

Need a data set to practice with? Data Science Dojo has created an archive of 32 data sets for you to use to practice and improve your skills as a data scientist.

Data Science with Python explained

An overview of using Python for data science including Numpy, Scipy, pandas, Scikit-Learn, XGBoost, TensorFlow and Keras.

Python For Data Science Full Course - Data Science With Python

Python For Data Science Full Course - Data Science With Python will help you learn Python for Data Science including all the relevant libraries. Learn: Introduction To Data Science; What is Data Science; Machine Learning With Python; Deep Learning With Python; Jupyter Notebook Tutorial; Machine Learning Algorithms; Statistics For Data Science; Python Libraries For Data Science; Python Numpy; Python Pandas; Python Scipy; Python Matplotlib; Python Seaborn; Maths For Machine Learning; Classification In Machine Learning; Linear Regression In Machine Learning; Keras Tutorial; TensorFlow Tutorial; Pyspark Tutorial; Logistic Regression In Machine Learning