Extracting Text from Scanned PDF using Pytesseract & Open CV

The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries.

I came across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python.

Converting PDF to Image

pdf2image is a python library which converts PDF to a sequence of PIL Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method.

pip install pdf2image

Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler.

https://anaconda.org/conda-forge/poppler

https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows

After installation, any pdf can be converted to images using the below code.

from pdf2image import convert_from_path

	pdfs = r"provide path to pdf file"
	pages = convert_from_path(pdfs, 350)

	i = 1
	for page in pages:
	    image_name = "Page_" + str(i) + ".jpg"  
	    page.save(image_name, "JPEG")
	    i = i+1

Convert PDF to Image using Python

After converting the PDF to images, the next step is to highlight the regions of the images from which we have to extract the information.

Note: Before marking regions make sure that you have preprocessed the image for improving its quality (DPI ≥ 300, Skewness, Sharpness and Brightness should be adjusted, Thresholding etc.)

#pytesseract #python #cv2 #invoice

Converting PDF to Image

towardsdatascience.com

Extracting Text from Scanned PDF using Pytesseract & Open CV