AWS S3 is an industry-leading object storage service. We tend to store lots of data files on S3 and at times require processing these files. If the size of the file that we are processing is small, we can basically go with traditional file processing flow, wherein we fetch the file from S3 and then process it row by row level. But the question arises, what if the file is size is more viz. > 1GB
? π
Importing (reading) a large file leads Out of Memory
error. It can also lead to a system crash event. There are libraries viz. Pandas, Dask, etc. which are very good at processing large files but again the file is to be present locally i.e. we will have to import it from S3 to our local machine. But what if we do not want to fetch and store the whole S3 file locally? π€
π Letβs consider some of the use-cases:
These are some very good scenarios where local processing may impact the overall flow of the system. Also, if we are running these file processing units in containers, then we have got limited disk space to work with. Hence, a cloud streaming flow is needed (which can also parallelize the processing of multiple chunks
of the same file by streaming different chunks of the same file in parallel threads/processes). This is where I came across the AWS S3 Select
feature. π
π This post focuses on the streaming of a large file into smaller manageable chunks (sequentially). This approach can then be used to parallelize the processing by running in concurrent threads/processes.
#aws-s3 #s3 select #aws