Apache Spark is arguably the most popular big data processing engine. With more than 25k stars on GitHub, the framework is an excellent starting point to learn parallel computing in distributed systems using Python, Scala and R.

To get started, you can run Apache Spark on your machine by using one of the many great Docker distributions available out there. Jupyter offers an excellent _dockerized _Apache Spark with a JupyterLab interface but misses the framework distributed core by running it on a single container. Some GitHub projects offer a distributed cluster experience however lack the JupyterLab interface, undermining the usability provided by the IDE.

I believe a comprehensive environment to learn and practice Apache Spark code must keep its distributed nature while providing an awesome user experience.

This article is all about this belief.

In the next sections, I will show you how to build your own cluster. By the end, you will have a fully functional Apache Spark cluster built with Docker and shipped with a Spark master node, two Spark worker nodes and a JupyterLab interface. It will also include the Apache Spark Python API (PySpark) and a simulated Hadoop distributed file system (HDFS).

TL;DR

This article shows how to build an Apache Spark cluster in standalone mode using Docker as the infrastructure layer. It is shipped with the following:

  • Python 3.7 with PySpark 3.0.0 and Java 8;
  • Apache Spark 3.0.0 with one master and two worker nodes;
  • JupyterLab IDE 2.1.5;
  • Simulated HDFS 2.7.

To make the cluster, we need to create, build and compose the Docker images for JupyterLab and Spark nodes. You can skip the tutorial by using the **out-of-the-box distribution **hosted on my GitHub.

Requirements

Docker 1.13.0+;

Docker Compose 3.0+.

Table of contents

  1. Cluster overview;
  2. Creating the images;
  3. Building the images;
  4. Composing the cluster;
  5. Creating a PySpark application.

#overviews #apache spark #docker #python #apache

Apache Spark Cluster on Docker
1.60 GEEK