Cesar  Hamill

Cesar Hamill


How to Extract Text from a PDF File using Python

Extract text from PDF File using Python:
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
Extracting Text from PDF File:
Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

To install this package type the below command in the terminal.
●pip install PyPDF2


Code - https://drive.google.com/drive/folders/1F8mfXKLIQ3dvwqR-54nBV4B-FHDi7wQu?usp=sharing

Let us try to understand the above code in chunks:

●pdfFileObj = open(‘example.pdf’, ‘rb’)
We opened the example.pdf in binary mode. and saved the file object as pdfFileObj.

●pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.

numPages property gives the number of pages in the pdf file. For example, in our case, it is 2 (see the first line of output).

●pageObj = pdfReader.getPage(0)
Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting from index 0) as an argument and returns the page object.

Page object has function extractText() to extract text from the pdf page.

At last, we close the pdf file object.

#python #machine-learning #artificial-intelligence #programming #developer

What is GEEK

Buddha Community

How to Extract Text from a PDF File using Python
Ray  Patel

Ray Patel


top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Ray  Patel

Ray Patel


Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

August  Larson

August Larson


Creating PDF Invoices in Python with pText


The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we’ll be using pText - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

We’ll take a look at how to create a PDF invoice in Python using pText.

#python #pdf #creating pdf invoices in python with ptext #creating pdf invoices #pdf invoice #creating pdf invoices in python with ptext

Paula  Hall

Paula Hall


How to extract tables from PDF using Python Pandas and tabula-py

A quick and ready script to extract repetitive tables from PDF

This tutorial is an improvement of my previous post, where I extracted multiple tables without Python pandas. In this tutorial, I will use the same PDF file, as that used in my previous post, with the difference that I manipulate the extracted tables with Python pandas.

The code of this tutorial can be downloaded from my Github repository.

Almost all the pages of the analysed PDF file have the following structure:

Image by Author

In the top-right part of the page, there is the name of the Italian region, while in the bottom-right part of the page there is a table.

Image by Author

I want to extract both the region names and the tables for all the pages. I need to extract the bounding box for both the tables. The full procedure to measure margins is illustrated in my previous post, section Define margins.

This script implements the following steps:

  • define the bounding box, which is represented through a list with the following shape: [top,left,bottom,width]. Data within the bounding box are expressed in cm. They must be converted to PDF points, since tabula-py requires them in this format. We set the conversion factor fc = 28.28.
  • extract data using the read_pdf() function
  • save data to a pandas dataframe.

In this example, we scan the pdf twice: firstly to extract the regions names, secondly, to extract tables. Thus we need to define two bounding boxes.

#data-collection #tabula-py #data-science #pdf-extraction #python #how to extract tables from pdf using python pandas and tabula-py

Art  Lind

Art Lind


How to Remove all Duplicate Files on your Drive via Python

Today you’re going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.


In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.

Heres a solution

Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that’s what this article is about.

But How do we do it?

If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it?

The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.

There’s a variety of hashing algorithms out there such as

  • md5
  • sha1
  • sha224, sha256, sha384 and sha512

#python-programming #python-tutorials #learn-python #python-project #python3 #python #python-skills #python-tips