Migrating Apache Spark workloads from AWS EMR to Kubernetes

Migrating Apache Spark workloads from AWS EMR to Kubernetes

I will focus on AWS Elastic Map Reduce since we are running our Spark workloads on AWS. We are using Apache Airflow for the workflow orchestration.

Introduction

I will focus on AWS Elastic Map Reduce since we are running our Spark workloads on AWS. We are using Apache Airflow for the workflow orchestration.

Image for post

Data Flow

The data comes from different sources that are spread across different geo regions and not necessarily running on the AWS cloud. For example, some of the data sources are web apps running in browsers, others are mobile applications, some are external data pipelines, etc. Here and here you can see how we implemented our data ingestion steps. All input data collected in S3 buckets and indexed by the creation date in AWS DynamoDB. Doing so allows us to process data batches by any given time interval. We are processing ±2TB data per day while having ‘special events’ days when the amount of data can be much bigger.

Problem Statement

Overall, AWS EMR does a great job. It is a reliable, scalable, and flexible tool to manage Apache Spark clusters. AWS EMR comes with out-of-the-box monitoring in a form of AWS Cloudwatch, it provides a rich toolbox that includes ZeppelinLivyHue, etc, and has very good security features. But AWS EMR has its own downgrades as well.

Portability: if you are building a multi-cloud or hybrid (cloud/on-prem) solution, be aware that migrating Spark Applications from AWS EMR can be a big deal. After running for a while on AWS EMR, you can find yourself tightly coupled to AWS specific features. It can be something simple, like logging and monitoring and it can be more complicated like an auto-scaling mechanism, custom master/worker AMIs, AWS security features, etc.

Cost overhead: the Amazon EMR price is in addition to the Amazon EC2 price. Take a look at the pricing example here

kubernetes aws apache-spark aws-eks aws-emr

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50+ Useful Kubernetes Tools for 2020 - Part 2

Our original Kubernetes tool list was so popular that we've curated another great list of tools to help you improve your functionality with the platform.

AWS Fargate for Amazon Elastic Kubernetes Service | Caylent

Easily run Kubernetes-based applications on AWS by leveraging AWS Fargate and Amazon Elastic Kubernetes Service together. Learn more here.

Data Processing Stack Overflow Data Using Apache Spark on AWS EMR

An overview on how to process data in spark using DataBricks, add the script as a step in AWS EMR and output the data to Amazon Redshift. This article is part of the series and continuation of the previous post. In the previous post, we saw how we can stream the data using Kinesis Firehose either using stackapi or using Kinesis Data Generator. In this post, let’s see how we can decide on the key processing steps that need to be performed before we send the data to any analytics tool.

What is Apache Spark? | Apache Spark Python | Spark Training

This Edureka "What is Apache Spark?" video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.

Apache Spark Tutorial For Beginners - Apache Spark Full Course

This video on Apache Spark Tutorial For Beginners - Apache Spark Full Course will help you learn the basics of Big Data, what Apache Spark is, and the architecture of Apache Spark. You will understand how to install Apache Spark on Windows and Ubuntu. You will look at the important components of Spark, such as Spark Streaming, Spark MLlib, and Spark SQL. You will get an idea about implement Spark with Python in PySpark tutorial and look at some of the important Apache Spark interview questions