Getting started with large-scale ETL jobs using Dask and AWS EMR

Dask is an increasingly popular Python-ecosystem SDK for managing large-scale ETL jobs and ETL pipelines across multiple machines. Albeit somewhat newer than Apache Spark — its best-known competitor — Dask has captured a lot of mindshare in the data science community by virtue of its pandas and numpy-like API, which makes it easy to use and familiar to Pythonic data practitioners.

In this tutorial, we will walk through setting up a Dask cluster on top of EMR (Elastic MapReduce), AWS’s distributed data platform, that we can interact with and submit jobs to from a JupyterLab notebook running on our local machine. We’ll then run some basic benchmarks on this cluster by performing a basic exploratory data analysis of NYC Open Data’s 2019 Yellow Taxi Trip Dataset.

Why EMR?

The Cloud Deployments page in the Dask docs covers your options for deploying Dask on the cloud. At the time of writing, the three options are: Kubernetes, EMR, and an ephemeral option using the “Dask Cloud Provider”.

My personal opinion is that EMR is the easiest way to get up and running with a distributed Dask cluster (if you want to experiment with it on a single machine, you can create a LocalCluster on your personal machine). Kubernetes is a complex service with a fairly steep learning curve, so I wouldn’t recommend going that route unless you’re already on a Kubernetes cluster and very familiar with how Kubernetes works.

Note that it’s also possible to deploy Dask on Google Cloud Dataproc or Azure HDInsight — any service that provides managed YARN will work — but there isn’t any specific documentation on these alternative services at the moment.

How EMR works

EMR, short for “Elastic Map Reduce”, is AWS’s big data as a service platform. Here’s how it works.

One of AWS’s core offerings is EC2, which provides an API for reserving machines (so-called instances) on the cloud. EC2 provides a wide variety of options, ranging from tiny burstable shared CPUs (e.g. t2.micro) to beefy (and expensive!) GPU servers (e.g. p3.16xlarge). As a first step to launching an EMR cluster, consider what EC2 instance types you will use. For the purposes of this tutorial, I will launch a cluster with one m5.xlarge master node and two m5.xlarge worker nodes (m5.xlarge is AWS’s recommended general-purpose CPU instance type). Note that when running on EMR, one of the instances will be reserved for the master node. The remainder become the worker pool.

#dask #exploratory-data-analysis #machine-learning #aws #distributed-systems #deep learning

Why EMR?

How EMR works

towardsdatascience.com

Getting started with large-scale ETL jobs using Dask and AWS EMR