Airflow is the de facto ETL orchestration tool in most data engineers tool box. It provides an intuitive web interface for a powerful backend to schedule and manage dependencies for your ETL workflows.

In my day to day work-flow, I use it to maintain and curate a data lake built on top of AWS S3. Nodes in my Airflow DAGs include multi-node EMR Apache Spark and Fargate clusters that aggregate, prune and produce para-data from the data lake.

Since these work-flows are executed on distributed clusters (20+ nodes) and have heavy dependencies (output from one ETL is fed in as input to the next) it made sense to orchestrate them using Airflow. However it did not make sense to have a central Airflow deployment as I will be the only one using it.

I therefore chose to Dockerize Airflow so that I could spin up a container and easily run these work-flows without having to worry about the Airflow deployment.

In this post I will go over how I achieved this along with some brief explanation of design decisions along the way.

Airflow Components

In Airflow ETL work-flows are defined as directed acyclic graphs (Airflow DAG) where each node is a self-contained ETL with each downstream node being dependent on successful completion of the upstream node.

Image for post

Simple Airflow DAG

Airflow has three deployment components:

Webserver ( Flask backend used to trigger and monitor DAGs)
Scheduler ( A daemon process to schedule and run the DAG executers )
Database ( A presistance layer for DAG & DAG instance definitions )

#aws #etl #apache-airflow #docker #data-science

Airflow Components

towardsdatascience.com

Apache Airflow on Docker for local workloads