This is part 4 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.
This series is about building data pipelines with Apache Spark for batch processing. Last time I presented the core ideas of Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing. Now it’s time to get Flowman up and running on your local machine.
In order to follow the instructions to get a working Flowman installation on your machine, you don’t need very much:
Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.
Fortunately, Spark is rather simple to install locally on your machine:
The currently latest release of Flowman is 0.14.2 and is available prebuilt for Spark 3.0.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.
## Create a nice playground which doesn't mess up your system
$ mkdir playground
$ cd playground
## Download and unpack Spark & Hadoop
$ curl -L https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz | tar xvzf -
## Create a nice link
$ ln -snf spark-3.0.1-bin-hadoop3.2 spark
The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.
#big-data #etl #spark