DevOps — Serverless OCR-NLP Pipeline using Amazon EKS, ECS and Docker

How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using event driven microservices architecture driven by Docker and Kubernetes

On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an *NLP *pipeline for further analysis.

Let's do a Proof of Concept — Our Findings

Time to convert a 100-page document — 10 minutes

Python process performing the OCR consumed around 6GB RAM and 4 CPU.

We needed to come up with a pipeline that not only keeps us with the regular demands but can auto-scale during peak periods.

Final Implementation

We decided to architect a serverless pipeline using event driven microservices. The entire process was broken down as follows:

  • Document uploaded in PDF Format — Handled using AWS Transfer for SFTP
  • Trigger an S3 event Notification for when a new PDF document is uploaded — Trigger a Lambda Function
  • Lambda Function adds an OCR event in Kinesis Streams
  • OCR microservice is triggered — Converts PDF to Text using the Tesseract Library (One per page). Text Output saved as JSON document in MongoDB
  • Add an NLP event in Kinesis Streams
  • NLP microservice reads JSON from MongoDB. Final Results of NLP saved back to MongoDB

