In Airflow, to describe the status of a DAG or a Task that is waiting to execute the next steps, we have defined State to share that information on the progress of the pipeline. Without the State, the execution of any DAG or task becomes a black box, and you might need to create additional external flag or resources to check status to help determine if a job finished or failed. Fortunately, Airflow provides the mechanism of State and stores each of the last recorded states in its backend DB. Not only this way is easy to watch the status of any job in Airflow UI or DB, but it’s also a persistent layer to help rerun or backfill while confronting failure.

In this article, we are going to discuss the fundamental of what is the Airflow State, what types are those states, how to use the Airflow State to test, and debug. There could be external service, and Airflow might track those states as well, but those states are out of scope for our discussion.

What does the State do in Airflow?

A good example of states in real life is like the traffic light. You’d have three states: RED, YELLOW, and GREEN. The RED light forbids any traffic from proceeding, whereas GREEN light allows traffic to proceed.

The most basic usage of the Airflow state is to designate the current status and assign the Airflow scheduler to decide future actions. Although there are more states in Airflow, similar to the traffic light, there are some common characteristics.

  • No Dual States. In Airflow, the State is a single value. Dual states are not permitted. In this way, a State with both “Failed” and “UP_FOR_RETRY” doesn’t make too much sense here.
  • The State is static, or a snapshot at a given moment. Airflow saved the State in its backend DB, and the updating of the State is not a continuous process. Due to the Airflow scheduler heartbeat interval, you could confront rare cases where the State in the DB is lag updating, and the scheduler goes down.
  • The State has a defined lifecycle. There is a detailed lifecycle diagram in the Airflow repository. The State has to follow the flow of the lifecycle, and the State usually cannot go backward except for retry cases.

#airflow #machine-learning #programming #tech #data-science #deep learning

Airflow State 101: An Overview Apache Airflow State
2.25 GEEK