Running Serverless Spark Applications with AWS Lambda

Leverage Amazon SageMaker Processing to run serverless Spark applications from AWS Lambda.

A widely known big data processing framework such as Apache Spark needs no introduction. If you are reading this post, you most likely know what you are getting into, and just like me, you are curious to know if it is possible to run serverless Spark jobs from an AWS Lambda function.

That also means you are familiar with AWS and serverless services such as AWS Lambda.

That being said, we all know that a little bit of context “never hurt nobody”. So let’s start with Spark!

Spark as an analytics engine for large-scale data processing relies on infrastructure and other software dependencies. Lucky for us, we can use cloud services such as AWS to remove the heavy lifting of installing, upgrading, and maintaining Apache Spark and its dependencies. At the same time, avoiding configuring and maintaining underlying infrastructure or operating systems altogether by using managed services such as Amazon EMR and Amazon SageMaker.

The idea for what you are about to learn came while working with Amazon SageMaker Studio. During the development of a recent ML project, I noticed that the SageMaker Processing feature provided to run Spark applications, simply put, was a blessing.

Now I can easily run preprocessing and post-processing workloads using Spark right from SageMaker Notebooks, and without disrupting my ML workflow. Furthermore, everything was serverless, and as mentioned before, I don’t need to worry about configuring and maintaining underlying infrastructure or software either.

That last part made me even more curious, now with the SageMaker Processing feature available, can I start coupling this feature with other serverless AWS services? Let’s say AWS Step Functions or AWS Lambda.

After some research, I found out that AWS officially added support for SageMaker Processing using Step Functions already. But I was not able to find information regarding this feature and AWS Lambda, which makes sense based on AWS Lambda limitations.

But regardless of the AWS Lambda limitations, I am too curious to let that stop me, so I went for it and made it work.

In this post, I will present a way to run serverless Spark applications using AWS Lambda and SageMaker Processing.

Get ready to clone repos and look at some code!

#aws #aws-lambda #serverless #spark

medium.com

Running Serverless Spark Applications with AWS Lambda