DevOps — Serverless OCR-NLP Pipeline using Amazon EKS, ECS and Docker

DevOps — Serverless OCR-NLP Pipeline using Amazon EKS, ECS and Docker

How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using event driven microservices architecture driven by Docker and Kubernetes

On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an *NLP *pipeline for further analysis.

Let's do a Proof of Concept — Our Findings

Time to convert a 100-page document — 10 minutes

Python process performing the OCR consumed around 6GB RAM and 4 CPU.

We needed to come up with a pipeline that not only keeps us with the regular demands but can auto-scale during peak periods.

Final Implementation

We decided to architect a serverless pipeline using event driven microservices. The entire process was broken down as follows:

  • Document uploaded in PDF Format — Handled using AWS Transfer for SFTP
  • Trigger an S3 event Notification for when a new PDF document is uploaded — Trigger a Lambda Function
  • Lambda Function adds an OCR event in Kinesis Streams
  • OCR microservice is triggered — Converts PDF to Text using the Tesseract Library (One per page). Text Output saved as JSON document in MongoDB
  • Add an NLP event in Kinesis Streams
  • NLP microservice reads JSON from MongoDB. Final Results of NLP saved back to MongoDB

artificial-intelligence big-data aws machine-learning data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Artificial Intelligence vs Machine Learning vs Data Science

Artificial Intelligence, Machine Learning, and Data Science are amongst a few terms that have become extremely popular amongst professionals in almost all the fields.

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

AI(Artificial Intelligence): The Business Benefits of Machine Learning

Enroll now at CETPA, the best Institute in India for Artificial Intelligence Online Training Course and Certification for students & working professionals & avail 50% instant discount.

Data Science vs Data Analytics vs Big Data

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

Data science vs. Machine Learning vs. Artificial Intelligence

In this tutorial on "Data Science vs Machine Learning vs Artificial Intelligence," we are going to cover the whole relationship between them and how they are different from each other.