You visit the nearest outlet of Walmart with your family to buy the usual weekly grocery. Bored, you look around to notice a young gentleman walking with a pen and paper. You watch him as he jots down the item names, prices, discounts offered in a well-formatted table. Curious, you walk up to him only to learn that writing all that stuff down is actually his job!
Most of us might not actually be aware, but yeah, many people note the item name, its price and discount offered on it for a living!
Won’t it be great if we could come up with a single pipeline that could detect all of that with no human intervention!?
We might have just the idea! Dig in!
Wait, aren’t we basically trying to detect text? After all, item names, prices, discounts all are either digits or alphabets. Can’t the golden combination of EAST (text detection) and Tesseract (text recognition) work?
With a single image having both the product and its price tag, Tesseract can fail miserably due to too much noise in the image.
Before going through each step in detail, let us briefly understand how we plan to solve this problem. Our input is a complete supermarket image that has multiple price tags. The task is to detect the price and text out of all the price tags.
First, the image is passed through an Object Detection Network to detect all the price tags. Followed by which, detected price tags act as input to another Object Detection Network for detecting price. Once the price is detected, we use pre-trained YOLO-v2 to detect the digits. Along with this, we can use Tesseract to get the text in the price tag.
You might not find many price tags that are decent enough to be sent to your model. Though we annotated images from this dataset by Yuhang, it was a tiring process and we still didn’t have much variety.
Augmenting Data is the way out! Go through this detailed repository to learn how to use imageaug to augment your data.
Sample Input Image
Object Detection is so simpler thanks to Tensorflow Detection API. In a recent blog here, we have discussed how you can go about creating the dataset, setting up the training process, and carrying out predictions.
MobileNet is a lightweight architecture with decent mAP and prediction duration, hence the selection! We chose the Feature Pyramid Network variant because it is proven to perform well in detecting small objects.
#machine-learning #object-detection #tensorflow