Consider a scenario where data objects are continuously being ingested at raw-data bucket. This data is periodically processed and stored at processed-data bucket. Our interest is to find a way how to track which new objects are unprocessed and process them. This in order to escape processing the same objects many times. Consider the below setup of AWS services as a response to this scenario.

Image for post

Primary setup of services

With a cloud, I have abstracted all different AWS services (EMR, Lambda, EC2 etc.) that you can use to process data. Squares represent data objects and their color represents their state as described in the figure below.

Image for post

#programming #data-engineering #data-sceince #aws

How to Track Unprocessed Objects in S3
3.10 GEEK