As the industry is becoming more data driven, we need to look for a couple of solutions that would be able to process a large amount of data that is required. A workflow management system provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Apache Airflow is a framework for processing data in a data pipeline. Although Airflow is not a data streaming solution, it deals with the data that is quite stable or slowly changing. It acts as an orchestrator by providing a solution to keep the processes coordinated in a distributed system. Airflow is an initiative of Airbnb. It is written in Python.

Airflow makes it easy for a user to author workflows using python scripts. A Directed Acyclic Graph (DAG) of tasks defines a workflow in Apache Airflow. It contains a set of tasks which executes along with their dependencies.

For example, to build a sales dashboard for your store, you need to perform the following tasks:

  1. Fetch the sales records information
  2. Clean the data / Sort the data according to the profit margins
  3. Push the data to the dashboard

The dependencies of the task mentioned above is:

These tasks are performed in a specific order. For example, Task 2 (cleaning the data) won’t start if we haven’t already completed Task1 (Fetching the data).

Scheduling of tasks

Apache Airflow allows us to define a schedule interval for each DAG, which determines exactly when your pipeline is run by Airflow. ​This way, you can tell Airflow to execute your DAG

@hourly Every Hour 0 * * * *

@daily Every Day 0 0 * * *

@weekly Every Week 0 0 * * 0

@none None

@once Once

and so on, or even use more complicated schedule intervals based on Cron-like expressions.

#apache airflow #big data and fast data #devops #airflow #airflow-setup #apache #data-pipelines

Apache Airflow - A Workflow Manager
1.35 GEEK