Tesseract-ocr is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. And made open source in 2005 and has been sponsored by google since 2006.This engine has it’s binaries, a CLI program and it’s original API is made in C++. Other programming languages have their own libraries which act as wrappers over the CLI program or original tesseract API on C++.In this article, we are going to discuss about the different options out there for using tesseract with python (their installation, executing different API ends)and some important problems encountered during there usage.

In python there are two most popular options: pytesseract and tesserocr.tesserocr is a python wrapper around the Tesseract C++ API. On the other hand, pytesseract is a wrapper the tesseract-ocr CLI program.And as you can guess tesserocr gives a lot more flexibility and control over tesseract. Tesserocr has multi-processing capabilities which are much faster in actual practise as compared to PyTesseract.

According to your use case you might require legacy package 3.05 or the latest tesseract version 4+ . For example tesseract 4.0 doesn’t provide font information of the recognized texts and therefore you would require the older version to gather that information.

Generic Setup:

For trying the examples you need to install anaconda for package and environment variable management.

Once installed, create a new environment with python installation which is required by the tesseract version being used (Note: for our examples use python 3.7)and activate the environment every time you run any example:

create environment: conda create -n tesseractEnvironment python=3.7
activate environment: conda activate tesseractEnvironment

Tesseract

For using any tesseract python wrapper we need to install tesseract-ocr first. To install tesseract on Debian/Ubuntu:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Note:_ the above command lines would install the latest available version of tesseract-ocr i.e. tesseract 4. You have to manually compile tesseract repository if you want to use an earlier version like 3.05 which might be necessary if you want to extract font-related information of texts, If it is too complex for you to manually compile tesseract 3.05 then, we will discuss further on a work around solution to complete the setup._

#python #tesseract #tesserocr #pytesseract

How to Use Tesseract with Python
6.20 GEEK