Stream a large S3 file into manageable chunks without downloading the whole file locally using AWS S3 Select

AWS S3 is an industry-leading object storage service. We tend to store lots of data files on S3 and at times require processing these files. If the size of the file that we are processing is small, we can basically go with traditional file processing flow, wherein we fetch the file from S3 and then process it row by row level. But the question arises, what if the file is size is more viz. > 1GB? πŸ˜“

Importing (reading) a large file leads Out of Memory error. It can also lead to a system crash event. There are libraries viz. PandasDask, etc. which are very good at processing large files but again the file is to be present locally i.e. we will have to import it from S3 to our local machine. But what if we do not want to fetch and store the whole S3 file locally? πŸ€”

πŸ“œ Let’s consider some of the use-cases:

  • We want to process a large CSV S3 file (~2GB) every day. It must be processed within a certain time frame (e.g. in 4 hours)
  • We are required to process large S3 files regularly from the FTP server. New files come in certain time intervals and to be processed sequentially i.e. the old file has to be processed before starting to process the newer files.

These are some very good scenarios where local processing may impact the overall flow of the system. Also, if we are running these file processing units in containers, then we have got limited disk space to work with. Hence, a cloud streaming flow is needed (which can also parallelize the processing of multiple chunks of the same file by streaming different chunks of the same file in parallel threads/processes). This is where I came across the AWS S3 Select feature. 😎

πŸ“ This post focuses on the streaming of a large file into smaller manageable chunks (sequentially). This approach can then be used to parallelize the processing by running in concurrent threads/processes.

#aws-s3 #s3 select #aws

Efficiently Streaming a Large AWS S3 File via S3 Select
1.90 GEEK