In Apache Airflow we can have very complex DAGs with several tasks, and dependencies between the tasks.
But what if we have cross-DAGs dependencies, and we want to make a DAG of DAGs? Normally, we would try to put all tasks that have dependencies in the same DAG. But sometimes you cannot modify the DAGs, and you may want to still add dependencies between the DAGs.
For that, we can use the ExternalTaskSensor.
This sensor will lookup past executions of DAGs and tasks, and will match those DAGs that share the same execution_date as our DAG. However, the name execution_date might be misleading: it is not a date, but an instant. So DAGs that are cross-dependent between them need to be run in the same instant, or one after the other by a constant amount of time. In summary, we need alignment in the execution dates and times.
Let’s see an example. We have two upstream DAGs, and we want to run another DAG after the first two DAGs have successfully finished.
This is the first DAG. It has only two dummy tasks.
"""Simple dag #1."""
from airflow import models
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators import python_operator
from airflow.utils.dates import days_ago
with models.DAG(
'dag_1',
schedule_interval='*/1 * * * *', # Every 1 minute
start_date=days_ago(0),
catchup=False) as dag:
def greeting():
"""Just check that the DAG is started in the log."""
import logging
logging.info('Hello World from DAG 1')
hello_python = python_operator.PythonOperator(
task_id='hello',
python_callable=greeting)
goodbye_dummy = DummyOperator(task_id='goodbye')
hello_python >> goodbye_dummy
The second upstream DAG is very similar to this one, so I don’t show the code here, but you can have a look at the code in Github.
The important aspect is that both DAGs have the same schedule and start dates (see the corresponding lines in the DAG 1 and in the DAG 2). Notice that the DAGs are run every minute. That’s only for the sake of this demo. In a real setting, that would be a very high frequency, so beware if you copy-paste some code for your own DAGs.
The downstream DAG will be executed when both upstream DAGs succeed. This is the code of the downstream DAG:
"""Trigger Dags #1 and #2 and do something if they succeed."""
from airflow import DAG
from airflow.operators.sensors import ExternalTaskSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
with DAG(
'master_dag',
schedule_interval='*/1 * * * *', # Every 1 minute
start_date=days_ago(0),
catchup=False) as dag:
def greeting():
"""Just check that the DAG is started in the log."""
import logging
logging.info('Hello World from DAG MASTER')
externalsensor1 = ExternalTaskSensor(
task_id='dag_1_completed_status',
external_dag_id='dag_1',
external_task_id=None, # wait for whole DAG to complete
check_existence=True,
timeout=120)
externalsensor2 = ExternalTaskSensor(
task_id='dag_2_completed_status',
external_dag_id='dag_2',
external_task_id=None, # wait for whole DAG to complete
check_existence=True,
timeout=120)
goodbye_dummy = DummyOperator(task_id='goodbye_master')
[externalsensor1, externalsensor2] >> goodbye_dummy
Some important points to notice. The schedule and start date is the same as the upstream DAGs. This is crucial for this DAG to respond to the upstream DAGs, that is, to add a dependency between the runs of the upstream DAGs and the run of this DAG.
And what if the execution dates don’t match but I still want to add a dependency? If the start dates differ by a constant amount of time, you can use the execution__delta_ parameter of ExternalTaskSensor.
#apache-airflow #python