Airflow, Airbnb’s brainchild, is an open-source data orchestration tool that allows you to programmatically schedule jobs in order to extract, transform, or load (ETL) data. Since Airflow’s workflows are written in Python as DAGs (directed acyclic graphs) they allow for complex computation, scalability, and maintainability unlike cron jobs or other scheduling tools. As a data scientist and engineer, data is incredibly important to me. I use Airflow to ensure that all the data I need is processed, cleaned, and available so I can easily run my models.

When I started this journey a year ago, I was scouring the web, looking for resources but there weren’t any. I wanted a clear path to deploying a scalable version of Airflow but many of the articles I found were incomplete or created small versions that couldn’t handle the amount of data I wanted to process. My goal was to run upwards of hundreds of thousands of jobs everyday, efficiently and reliably, without putting a dent in my wallet. At that point, the KubernetesExecutor wasn’t released to the public yet but now it is.

I’m writing this article to hopefully prevent you from many sleepless nights I spent pondering the existence of Airflow (as well as my own) and cultivating an unhealthy obsession for it. This is how I created a scalable, production-ready Airflow with the latest version (1.10.10) in 10 easy steps.

Prerequisites:

  • Docker. You can use their awesome instructions here or you can refer to this PDF I made if you have an Ubuntu system:
  • Git repository called “dags” so you can store your workflows; this allows for collaboration.
  • A container repository to store your completed image.
  • Kubernetes cluster (or minikube) if you want to deploy Airflow in production.

STEP 1: Docker Image

There are a lot of options, including a helm chart (which uses the puckel image), but I find Dockerfile to be the most intuitive, direct, and customizable version. There are few great images out there — puckel is fantastic and works great if you’re starting out — but if you’re planning on using the KubernetesExecutor, I recommend that you create your own image using Dockerfile. You can also get the latest version of Airflow (1.10.10) using this method.

In your linux environment type:

cd ~
git clone https://github.com/spanneerselvam/airflow-image.git
cd airflow-image
ls

Your config directory has all the files you will copy over from your machine to Airflow. Open the Dockerfile.

You should see the following code here (this is just a snippet):

FROM python:3.7
RUN apt-get update && apt-get install -y supervisor
USER root
RUN apt-get update && apt-get install — yes \
sudo \
git \
vim \
cron \
gcc
RUN pip install apache-airflow[1.10.10]
RUN cd /usr/local && mkdir airflow && chmod +x airflow && cd airflow
RUN useradd -ms /bin/bash airflow
RUN usermod -a -G sudo airflow
RUN chmod 666 -R /usr/local/airflow
ARG AIRFLOW_USER_HOME=/usr/local/airflow
ENV AIRFLOW_HOME=${AIRFLOW_USER_HOME}
COPY config/airflow.cfg ${AIRFLOW_USER_HOME}/airflow.cfg
EXPOSE 8080
#Python Package Dependencies for Airflow
RUN pip install pyodbc flask-bcrypt pymssql sqlalchemy psycopg2-binary pymysql

Here’s a fun fact for you (your definition of fun is probably _very _different from mine): the Docker logo is this really cute whale — like it’s literally the cutest logo I’ve ever seen (don’t believe me? Go on, go Google it!) — because it overwhelmingly won in a logo contest and even beat giraffes! Docker even adopted a whale named Molly Dock. She’s swimming away in the vast Pacific ocean. Why do I know this? Well when you’re awake really late deploying Airflow, you end up Googling some strange things…

STEP 2: DAGs

Setting up DAGs in Airflow using the KubernetesExecutor is tricky and this was the last piece of the puzzle I put together. There are few options such as embedding the DAGs in your docker image but the issue with this approach is that you have to rebuild your image every time you change your DAG code.

I find that the best solution for collaboration is to use GitHub (or BitBucket) to store your DAGs. Create your own repo (clone mine — more on that later) and then you and your team can push all your work into the repository. Once you’ve done this, you need to mount the DAGs to the pods that run Airflow by using PV and PVC (Persistent Volume and Persistent Volume Claim) with Azure file share (or the EKS equivalent).

PV and PVC are services offered by Kubernetes. Think of them as shared storage resources that can attach to every single pod you deploy. To do this you need to create an Azure file share (instructions are here). Make sure that you mount the file share to your computer (instructions for mounting can be found here) so it pulls code from your git repo and shows up in the Azure file share. I used a simple cron job that runs a shell script called git_sync.sh every minute to pull code from GitHub.

crontab -e:

* * * * /home/git_sync.sh

#Mandatory Blank Line

git_sync.sh (Note: my remote name is “DAGs” not origin):

cd ~
git clone git@github.com:spanneerselvam/airflow-image.git
cd airflow-image
ls

Once you’ve done this (I’d recommend using at least 5 Gi of storage for your Azure File share), you need to deploy the Azure File share in Kubernetes. Follow these steps (they apply for EKS as well):

  1. Create a Kubernetes secret for your Azure File share. Read this guide to securely create your secret here.
  2. Deploy a PVC (see code airflow-pvc.yaml). You only have to do this once.
  3. Deploy a PV (see code airflow-pv.yaml and airflow-pv-k8s.yaml in repo). You have to do this for each namespace (in my case, “default” and “k8s-tasks).

Steps #2–3: To deploy PV and PVC, this run the following lines of code:

kubectl create -f airflow-pvc.yaml
kubectl get pvc
kubectl create -f airflow-pv.yaml
kubectl create -f airflow-pv-k8s.yaml
kubectl get pv

If the statuses of the PVs and PVC are “Bound”, you’re good to go!

Now that the DAGs are showing up in the Azure File share, you need to adjust the Airflow settings. I store my DAGs in the pod in the folder called “/usr/local/airflow/DAGs”. This folder is mounted in the master pod from the File share but in order to work correctly it also must be mounted in each worker pod. If you look at the airflow.cfg, notice these two settings under the [kubernetes] section.

dags_in_image = False #The worker will get the mount location for the dags
#dags_volume_subpath = This line is commented out because the mount folder is the same as the dag folder
dags_volume_claim = airflow-dags #put your claim name here (this must match your airflow-pvc.yaml file)

Writing DAGs: Also check out this handy-dandy guide I wrote to the art of Writing DAGs here. You can clone this repo and download the two DAGs, template_dag.py and gcp_dag.py here.

STEP 3: Logging

For every task run, airflow creates a log that helps the user debug. Here, we are storing the logs in the _“/usr/local/airflow/logs” _folder. This is such a critical piece. I’ve had so many issues with logging — the dreaded “*** Log file does not exist” comes to mind — and I swear every time I’d get this error, I’d die a little (a lot, actually) inside and get heart palpitations. But not to worry, I’ve got your back! There are two options when it comes to Kubernetes.

  1. Using a PVC (Persistent Volume Claim) on a Kubernetes cluster
  2. Remote logging

Since we’ve already gone through the process of using a PVC, I will show you how to use remote logging with GCP (Google Cloud Platform). Create a bucket using the Google console and then a folder called “logs”. Make sure you open the permissions of your bucket; you can check them out here. You also need to create a Service Account which can be found here as well (Thank goodness GCP has AMAZING instructions!) Download the json authentication file and copy the contents into the “airflow-image/config/gcp.json” file.

You need to add the bucket and folder details to the airflow.cfg and Dockerfile in line 72. Once you do that, you’re golden (kind of). I’ve created a log connection ID with the name “AirflowGCPKey”. This ID is associated with the sensitive details of the GCP connection. You can create this ID in the UI or what I like to do personally is run the gcp_dag.py in the UI that creates this connection automatically (the code for this is here).

RUN pip install apache-airflow[gcp] apache-airflow[gcp-api]
RUN echo “deb http://packages.cloud.google.com/apt cloud-sdk main” | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
RUN apt-get install gnupg -y
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update && apt-get install google-cloud-sdk -y
RUN gcloud auth activate-service-account <insert your service account> — key-file=/usr/local/airflow/gcp.json — project=<your project name>

#kubernetes #docker #airflow #data-science

Deploying Scalable, Production-Ready Airflow in 10 Easy Steps Using Kubernetes
3.70 GEEK