As the industry is becoming more data driven, we need to look for a couple of solutions that would be able to process a large amount of data that is required. A workflow management system provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Apache Airflow is a framework for processing data in a data pipeline. Although Airflow is not a data streaming solution, it deals with the data that is quite stable or slowly changing. It acts as an orchestrator by providing a solution to keep the processes coordinated in a distributed system. Airflow is an initiative of Airbnb. It is written in Python.
Airflow makes it easy for a user to author workflows using python scripts. A Directed Acyclic Graph (DAG) of tasks defines a workflow in Apache Airflow. It contains a set of tasks which executes along with their dependencies.
For example, to build a sales dashboard for your store, you need to perform the following tasks:
The dependencies of the task mentioned above is:
These tasks are performed in a specific order. For example, Task 2 (cleaning the data) won’t start if we haven’t already completed Task1 (Fetching the data).
Apache Airflow allows us to define a schedule interval for each DAG, which determines exactly when your pipeline is run by Airflow. This way, you can tell Airflow to execute your DAG
@hourly Every Hour 0 * * * *
@daily Every Day 0 0 * * *
@weekly Every Week 0 0 * * 0
@none None
@once Once
and so on, or even use more complicated schedule intervals based on Cron-like expressions.
#apache airflow #big data and fast data #devops #airflow #airflow-setup #apache #data-pipelines