In Apache Airflow we can have very complex DAGs with several tasks, and dependencies between the tasks.

But what if we have cross-DAGs dependencies, and we want to make a DAG of DAGs? Normally, we would try to put all tasks that have dependencies in the same DAG. But sometimes you cannot modify the DAGs, and you may want to still add dependencies between the DAGs.

For that, we can use the ExternalTaskSensor.

This sensor will lookup past executions of DAGs and tasks, and will match those DAGs that share the same execution_date as our DAG. However, the name execution_date might be misleading: it is not a date, but an instant. So DAGs that are cross-dependent between them need to be run in the same instant, or one after the other by a constant amount of time. In summary, we need alignment in the execution dates and times.

Let’s see an example. We have two upstream DAGs, and we want to run another DAG after the first two DAGs have successfully finished.

This is the first DAG. It has only two dummy tasks.

"""Simple dag #1."""
	from airflow import models
	from airflow.operators.dummy_operator import DummyOperator
	from airflow.operators import python_operator
	from airflow.utils.dates import days_ago

	with models.DAG(
	        'dag_1',
	        schedule_interval='*/1 * * * *',  # Every 1 minute
	        start_date=days_ago(0),
	        catchup=False) as dag:
	    def greeting():
	        """Just check that the DAG is started in the log."""
	        import logging
	        logging.info('Hello World from DAG 1')

	    hello_python = python_operator.PythonOperator(
	        task_id='hello',
	        python_callable=greeting)

	    goodbye_dummy = DummyOperator(task_id='goodbye')

	    hello_python >> goodbye_dummy

The second upstream DAG is very similar to this one, so I don’t show the code here, but you can have a look at the code in Github.

The important aspect is that both DAGs have the same schedule and start dates (see the corresponding lines in the DAG 1 and in the DAG 2). Notice that the DAGs are run every minute. That’s only for the sake of this demo. In a real setting, that would be a very high frequency, so beware if you copy-paste some code for your own DAGs.

The downstream DAG will be executed when both upstream DAGs succeed. This is the code of the downstream DAG:

"""Trigger Dags #1 and #2 and do something if they succeed."""
	from airflow import DAG
	from airflow.operators.sensors import ExternalTaskSensor
	from airflow.operators.dummy_operator import DummyOperator
	from airflow.utils.dates import days_ago

	with DAG(
	        'master_dag',
	        schedule_interval='*/1 * * * *',  # Every 1 minute
	        start_date=days_ago(0),
	        catchup=False) as dag:
	    def greeting():
	        """Just check that the DAG is started in the log."""
	        import logging
	        logging.info('Hello World from DAG MASTER')

	    externalsensor1 = ExternalTaskSensor(
	        task_id='dag_1_completed_status',
	        external_dag_id='dag_1',
	        external_task_id=None,  # wait for whole DAG to complete
	        check_existence=True,
	        timeout=120)

	    externalsensor2 = ExternalTaskSensor(
	        task_id='dag_2_completed_status',
	        external_dag_id='dag_2',
	        external_task_id=None,  # wait for whole DAG to complete
	        check_existence=True,
	        timeout=120)

	    goodbye_dummy = DummyOperator(task_id='goodbye_master')

	    [externalsensor1, externalsensor2] >> goodbye_dummy

Some important points to notice. The schedule and start date is the same as the upstream DAGsThis is crucial for this DAG to respond to the upstream DAGs, that is, to add a dependency between the runs of the upstream DAGs and the run of this DAG.

And what if the execution dates don’t match but I still want to add a dependency? If the start dates differ by a constant amount of time, you can use the execution__delta_ parameter of ExternalTaskSensor.

#apache-airflow #python

Dependencies between DAGs in Apache Airflow
33.35 GEEK