Big Data Engineering — Flowman up and running

Big Data Engineering — Flowman up and running

Big Data Engineering — Flowman up and running. See the open source, Spark based ETL tool called Flowman in action right on your machine.

This is part 4 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.

What to expect

This series is about building data pipelines with Apache Spark for batch processing. Last time I presented the core ideas of Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing. Now it’s time to get Flowman up and running on your local machine.


In order to follow the instructions to get a working Flowman installation on your machine, you don’t need very much:

  • Required: 64bit Linux (sorry, no Windows or Mac OS at this time)
  • Required: Java (OpenJDK is fine)
  • Optional: Maven and npm if you want to build Flowman from sources
  • Recommended: AWS credentials for accessing some test data on S3

Installing Hadoop & Spark

Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.

Fortunately, Spark is rather simple to install locally on your machine:

Download & Install Spark

The currently latest release of Flowman is 0.14.2 and is available prebuilt for Spark 3.0.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.

## Create a nice playground which doesn't mess up your system
$ mkdir playground
$ cd playground

## Download and unpack Spark & Hadoop
$ curl -L | tar xvzf -
## Create a nice link
$ ln -snf spark-3.0.1-bin-hadoop3.2 spark

The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.

big-data etl spark

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

How you’re losing money by not opting for Big Data Services?

Big Data Analytics is the next big thing in business, and it is a reality that is slowly dawning amongst companies. With this article, we have tried to show you the importance of Big Data in business and urge you to take advantage of this immense...

Data Lakes Are Not Just For Big Data - DZone Big Data

A data expert discusses the three different types of data lakes and how data lakes can be used with data sets not considered 'big data.'