Big Data Engineering —Declarative Data Flows

This is part 3 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.

Part 1: Big Data Engineering — Best Practices
Part 2: Big Data Engineering — Apache Spark
Part 3: Big Data Engineering —Declarative Data Flows
Part 4: Big Data Engineering — Flowman up and running

What to expect

This series is about building data pipelines with Apache Spark for batch processing. But some aspects are also valid for other frameworks or for stream processing. Eventually I will introduce Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing.

Functional Requirements

Part 1 of the series already pointed out the existence of two types of requirements, which apply to probably almost all kinds of applications: Functional and Non-Functional requirements. Let us first focus on the first type of requirements.

Functional requirements describe the real problem that a solution should solve in the first place. They describe the core functionality and what should be implemented. That can be a data processing application which is required to integrate multiple data sources and perform an aggregation to provide a simplified data model for some data scientists. But the idea of functional requirements also apply to an Android application for booking a trip to different city.

Functional requirements are always written by business experts as proxies of final end user. They should focus on the problem to be solved, such that the developer or architect still can decide how a viable solution should look like. Of course the chosen approach should be validated together with the business experts before implementation.

Note that I specifically wrote that functional requirements should focus on the problem (or task) and not on the solution. This small difference is very important, since it leaves up a bigger design space for the developer to find the best possible solution as opposed to a specific solution given by the business expert, which may be difficult to implement within the given technology.

#big-data #spark #data-engineering

What to expect

Functional Requirements

towardsdatascience.com

Big Data Engineering —Declarative Data Flows