Running Spark Application in the EMR Cluster Through AWS Lambda Function

This post gives you a quick walkthrough on AWS Lambda Functions and running Apache Spark in the EMR cluster through the Lambda function. It also explains how to trigger the function using other Amazon Services like S3.

What is AWS Lambda?

AWS Lambda is one of the ingredients in Amazon’s overall serverless computing paradigm and it allows you to run code without thinking about the servers. Serverless computing is a hot trend in the Software architecture world. It enables developers to build applications faster by eliminating the need to manage infrastructures. With serverless applications, the cloud service provider automatically provisions, scales, and manages the infrastructures required to run the code.

It abstracts away all components that you would normally require including servers, platforms, and virtual machines so that you can just focus on writing the code.

Since you don’t have to worry about any of those other things, the time to production and deployment is very low. Another great benefit of the Lambda function is that you only pay for the compute time that you consume. This means that you are being charged only for the time taken by your code to execute.

This is in contrast to any other traditional model where you pay for servers, updates, and maintenances.

The AWS Lambda free usage tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month. To know about the pricing details, please refer to the AWS documentation: https://aws.amazon.com/lambda/pricing/

You do need an AWS account to go through the exercise below and if you don’t have one just head over to https://aws.amazon.com/console/. If you are a student, you can benefit through the no-cost AWS Educate Program. I would suggest you sign up for a new account and get $75 as AWS credits. I won’t walk through every step of the signup process since its pretty self explanatory.

What is Apache Spark?

Apache Spark is a distributed data processing framework and programming model that helps you do machine learning, stream processing, or graph analytics. It is an open-source, distributed processing system that can quickly perform processing tasks on very large data sets. It is often compared to Apache Hadoop, and specifically to MapReduce, Hadoop’s native data-processing component. The difference between spark and MapReduce is that Spark actively caches data in-memory and has an optimized engine which results in dramatically faster processing speed.

To know more about Apache Spark, you can refer to these links:

https://spark.apache.org/docs/latest/

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html

In this article, I would go through the following:

Creating Lambda functions in Python
Integrating it with other AWS services such as S3
Running a Spark job as a Step Function in EMR cluster

I assume that you have already set AWS CLI in your local system. If not, you can quickly go through this tutorial https://cloudacademy.com/blog/how-to-use-aws-cli/ to set it up.

I have tried to run most of the steps through CLI so that we get to know what’s happening behind the picture.

#s3 #aws #emr #aws-lambda

What is AWS Lambda?

What is Apache Spark?

medium.com

Running Spark Application in the EMR Cluster Through AWS Lambda Function