This is part 4 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.

What to expect

This series is about building data pipelines with Apache Spark for batch processing. Last time I presented the core ideas of Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing. Now it’s time to get Flowman up and running on your local machine.

Prerequisites

In order to follow the instructions to get a working Flowman installation on your machine, you don’t need very much:

  • Required: 64bit Linux (sorry, no Windows or Mac OS at this time)
  • Required: Java (OpenJDK is fine)
  • Optional: Maven and npm if you want to build Flowman from sources
  • Recommended: AWS credentials for accessing some test data on S3

Installing Hadoop & Spark

Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.

Fortunately, Spark is rather simple to install locally on your machine:

Download & Install Spark

The currently latest release of Flowman is 0.14.2 and is available prebuilt for Spark 3.0.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.

## Create a nice playground which doesn't mess up your system
$ mkdir playground
$ cd playground

## Download and unpack Spark & Hadoop
$ curl -L https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz | tar xvzf -
## Create a nice link
$ ln -snf spark-3.0.1-bin-hadoop3.2 spark

The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.

#big-data #etl #spark

Big Data Engineering — Flowman up and running
1.50 GEEK