Today, with the vast improvements in machine learning, character extraction and recognition from images is much simpler than before, thanks to well developed deep learning algorithms such as CNN, LSTM, etc. Before the advent of these sophisticated machine learning algorithms, one had to use template matching to match every character image with predefined templates. Template matching required us to have a well defined cropped character image — however, cropping the image to conformity was difficult. Thus, finding a good algorithm for cropping characters and preprocessing images to conform to the requirements was time consuming.
Deep learning is one of the most powerful tools to perform image recognition. There are many libraries of trained models based on deep learning. For instance, Yolo is popular for object recognition. But if we want to use Yolo to create a bounding box for characters when doing character recognition, we have to create and train our own model, or additionally fine tune an existing model. In such cases, the most time-consuming parts are collecting datasets and training the model itself.
On the other hand, Google have published their own OCR (Optical Character Recognition) tool, named [Tesseract](https://tesseract-ocr.github.io/)
. This tool has already been trained on more than 400,000 lines of text, spanning about 4,500 fonts for Latin-characters. It also supports non-Latin characters such as Japanese, Chinese, etc. Given its advantages and robust training, it’s preferable for us to directly use Tesseract
to perform character recognition without having to train or create any new models. Below, I will discuss how one can get better results using Google’s Tesseract
.
Tesseract
The installation document can be found here.
For Ubuntu users, you can use the following command line code for installing it from the terminal:
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt update
sudo apt install tesseract-ocr libtesseract-dev
Since Tesseract
is a command line tool, we need to install a python package called [PyTesseract](https://github.com/madmaze/pytesseract)
so that it can be used within our Python script.
pip install pytesseract
There are several articles online on how to use Tesseract
for OCR. The article “Using Tesseract OCR with Python” covers how one can combine Tesseract
and PyTesseract
with python. Though this article covers the fundamental details on how the package can be used, it does not go into detail on how to preprocess complicated images.
The article “OCR with Python, OpenCV and PyTesseract” introduces some more details on how Tesseract
works and also gives good examples for the different methods for preprocessing images.
In addition to the above articles, the internet has many more detailed articles covering the topic. These articles can be easily found by searching them online. I have only listed two articles here, which according to me are suitable to cover the essential details.
#image-processing #image-preprocessing #opencv-python #character-recognition #tesseract