Alex Tyler

Alex Tyler

1596687060

How to Run Apache Airflow DAG with Docker

In this blog, we are going to run the sample dynamic DAG using docker.

Before that, let’s get a quick idea about the airflow and some of its terms.

What is Airflow?

Airflow is a workflow engine which is responsible for managing and scheduling running jobs and data pipelines. It ensures that the jobs are ordered correctly based on dependencies and also manages the allocation of resources and failures.**

Before going forward, let’s get familiar with the terms:

Task or Operator: A defined unit of work.

Task instance: An individual run of a single task. The states could be running, success, failed, skipped, and up for retry.

DAG (Directed Acyclic Graph): A set of tasks with an execution order.

**DAG Run: **Individual DAG run.

Web Server: It is the UI of airflow, it also allows us to manage users, roles, and different configurations for the Airflow setup.

Scheduler: Schedules the jobs or orchestrates the tasks. It uses the DAGs object to decide what tasks need to be run, when, and where.

Executor: Executes the tasks. There are different types of executors:

  • Sequential: Runs one task instance at a time.
  • Local: Runs tasks by spawning processes in a controlled fashion in different modes.
  • Celery: An asynchronous task queue/job queue based on distributed message passing. For CeleryExecutor, one needs to set up a queue (Redis, RabbitMQ or any other task broker supported by Celery) on which all the celery workers running keep on polling for any new tasks to run
  • Kubernetes: Provides a way to run Airflow tasks on Kubernetes, Kubernetes launch a new pod for each task.

Metadata Database: Stores the Airflow states. Airflow uses SqlAlchemy and Object Relational Mapping (ORM) written in Python to connect to the metadata database.

Now that we are familiar with the terms, let’s get started.

#apache airflow #docker #python #scala

What is GEEK

Buddha Community

How to Run Apache Airflow DAG with Docker
Cayla  Erdman

Cayla Erdman

1599914520

Apache/Airflow and PostgreSQL with Docker and Docker Compose

Hello, in this post I will show you how to set up official Apache/Airflow with PostgreSQL and LocalExecutor using docker and docker-compose. In this post, I won’t be going through Airflow, what it is, and how it is used. Please checktheofficial documentation for more information about that.

Before setting up and running Apache Airflow, please install Docker and Docker Compose.

For those in hurry…

In this chapter, I will show you files and directories which are needed to run airflow and in the next chapter, I will go file by file, line by line explaining what is going on.

Firstly, in the root directory create three more directories: dagslogs, and scripts. Further, create following files: **.env, docker-compose.yml, entrypoint.sh **and **dummy_dag.py. **Please make sure those files and directories follow the structure below.

#project structure

root/
├── dags/
│   └── dummy_dag.py
├── scripts/
│   └── entrypoint.sh
├── logs/
├── .env
└── docker-compose.yml

Created files should contain the following:

#docker-compose.yml

version: '3.8'
services:
    postgres:
        image: postgres
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=airflow
            - POSTGRES_DB=airflow
    scheduler:
        image: apache/airflow
        command: scheduler
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
    webserver:
        image: apache/airflow
        entrypoint: ./scripts/entrypoint.sh
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./scripts:/opt/airflow/scripts
        ports:
            - "8080:8080"
#entrypoint.sh
#!/usr/bin/env bash
airflow initdb
airflow webserver
#.env
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
#dummy_dag.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
with DAG('example_dag', start_date=datetime(2016, 1, 1)) as dag:
    op = DummyOperator(task_id='op')

Positioning in the root directory and executing “docker-compose up” in the terminal should make airflow accessible on localhost:8080. Image bellow shows the final result.

If you encounter permission errors, please run “chmod -R 777” on all subdirectories, e.g. “chmod -R 777 logs/”


For the curious ones...

In Leyman’s terms, docker is used when managing individual containers and docker-compose can be used to manage multi-container applications. It also moves many of the options you would enter on the docker run into the docker-compose.yml file for easier reuse. It works as a front end "script" on top of the same docker API used by docker. You can do everything docker-compose does with docker commands and a lot of shell scripting.

Before running our multi-container docker applications, docker-compose.yml must be configured. With that file, we define services that will be run on docker-compose up.

The first attribute of docker-compose.yml is version, which is the compose file format version. For the most recent version of file format and all configuration options click here.

Second attribute is services and all attributes one level bellow services denote containers used in our multi-container application. These are postgres, scheduler and webserver. Each container has image attribute which points to base image used for that service.

For each service, we define environment variables used inside service containers. For postgres it is defined by environment attribute, but for scheduler and webserver it is defined by .env file. Because .env is an external file we must point to it with env_file attribute.

By opening .env file we can see two variables defined. One defines executor which will be used and the other denotes connection string. Each connection string must be defined in the following manner:

dialect+driver://username:password@host:port/database

Dialect names include the identifying name of the SQLAlchemy dialect, a name such as sqlite, mysql, postgresql, oracle, or mssql. Driver is the name of the DBAPI to be used to connect to the database using all lowercase letters. In our case, connection string is defined by:

postgresql+psycopg2://airflow:airflow@postgres/airflow

Omitting port after host part denotes that we will be using default postgres port defined in its own Dockerfile.

Every service can define command which will be run inside Docker container. If one service needs to execute multiple commands it can be done by defining an optional .sh file and pointing to it with entrypoint attribute. In our case we have entrypoint.sh inside the scripts folder which once executed, runs airflow initdb and airflow webserver. Both are mandatory for airflow to run properly.

Defining depends_on attribute, we can express dependency between services. In our example, webserver starts only if both scheduler and postgres have started, also the scheduler only starts after postgres have started.

In case our container crashes, we can restart it by restart_policy. The restart_policy configures if and how to restart containers when they exit. Additional options are condition, delay, max_attempts, and window.

Once service is running, it is being served on containers defined port. To access that service we need to expose the containers port to the host's port. That is being done by ports attribute. In our case, we are exposing port 8080 of the container to TCP port 8080 on 127.0.0.1 (localhost) of the host machine. Left side of : defines host machines port and the right-hand side defines containers port.

Lastly, the volumes attribute defines shared volumes (directories) between host file system and docker container. Because airflows default working directory is /opt/airflow/ we need to point our designated volumes from the root folder to the airflow containers working directory. Such is done by the following command:

#general case for airflow
- ./<our-root-subdir>:/opt/airflow/<our-root-subdir>
#our case
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
           ...

This way, when the scheduler or webserver writes logs to its logs directory we can access it from our file system within the logs directory. When we add a new dag to the dags folder it will automatically be added in the containers dag bag and so on.

Originally published by Ivan Rezic at Towardsdatascience

#docker #how-to #apache-airflow #docker-compose #postgresql

Gerhard  Brink

Gerhard Brink

1624099260

Apache Airflow - A Workflow Manager

As the industry is becoming more data driven, we need to look for a couple of solutions that would be able to process a large amount of data that is required. A workflow management system provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Apache Airflow is a framework for processing data in a data pipeline. Although Airflow is not a data streaming solution, it deals with the data that is quite stable or slowly changing. It acts as an orchestrator by providing a solution to keep the processes coordinated in a distributed system. Airflow is an initiative of Airbnb. It is written in Python.

Airflow makes it easy for a user to author workflows using python scripts. A Directed Acyclic Graph (DAG) of tasks defines a workflow in Apache Airflow. It contains a set of tasks which executes along with their dependencies.

For example, to build a sales dashboard for your store, you need to perform the following tasks:

  1. Fetch the sales records information
  2. Clean the data / Sort the data according to the profit margins
  3. Push the data to the dashboard

The dependencies of the task mentioned above is:

These tasks are performed in a specific order. For example, Task 2 (cleaning the data) won’t start if we haven’t already completed Task1 (Fetching the data).

Scheduling of tasks

Apache Airflow allows us to define a schedule interval for each DAG, which determines exactly when your pipeline is run by Airflow. ​This way, you can tell Airflow to execute your DAG

@hourly Every Hour 0 * * * *

@daily Every Day 0 0 * * *

@weekly Every Week 0 0 * * 0

@none None

@once Once

and so on, or even use more complicated schedule intervals based on Cron-like expressions.

#apache airflow #big data and fast data #devops #airflow #airflow-setup #apache #data-pipelines

Alex Tyler

Alex Tyler

1596687060

How to Run Apache Airflow DAG with Docker

In this blog, we are going to run the sample dynamic DAG using docker.

Before that, let’s get a quick idea about the airflow and some of its terms.

What is Airflow?

Airflow is a workflow engine which is responsible for managing and scheduling running jobs and data pipelines. It ensures that the jobs are ordered correctly based on dependencies and also manages the allocation of resources and failures.**

Before going forward, let’s get familiar with the terms:

Task or Operator: A defined unit of work.

Task instance: An individual run of a single task. The states could be running, success, failed, skipped, and up for retry.

DAG (Directed Acyclic Graph): A set of tasks with an execution order.

**DAG Run: **Individual DAG run.

Web Server: It is the UI of airflow, it also allows us to manage users, roles, and different configurations for the Airflow setup.

Scheduler: Schedules the jobs or orchestrates the tasks. It uses the DAGs object to decide what tasks need to be run, when, and where.

Executor: Executes the tasks. There are different types of executors:

  • Sequential: Runs one task instance at a time.
  • Local: Runs tasks by spawning processes in a controlled fashion in different modes.
  • Celery: An asynchronous task queue/job queue based on distributed message passing. For CeleryExecutor, one needs to set up a queue (Redis, RabbitMQ or any other task broker supported by Celery) on which all the celery workers running keep on polling for any new tasks to run
  • Kubernetes: Provides a way to run Airflow tasks on Kubernetes, Kubernetes launch a new pod for each task.

Metadata Database: Stores the Airflow states. Airflow uses SqlAlchemy and Object Relational Mapping (ORM) written in Python to connect to the metadata database.

Now that we are familiar with the terms, let’s get started.

#apache airflow #docker #python #scala

Iliana  Welch

Iliana Welch

1595249460

Docker Explained: Docker Architecture | Docker Registries

Following the second video about Docker basics, in this video, I explain Docker architecture and explain the different building blocks of the docker engine; docker client, API, Docker Daemon. I also explain what a docker registry is and I finish the video with a demo explaining and illustrating how to use Docker hub

In this video lesson you will learn:

  • What is Docker Host
  • What is Docker Engine
  • Learn about Docker Architecture
  • Learn about Docker client and Docker Daemon
  • Docker Hub and Registries
  • Simple demo to understand using images from registries

#docker #docker hub #docker host #docker engine #docker architecture #api

Paris  Turcotte

Paris Turcotte

1617915840

How to Run Your First Airflow DAG in Docker

In the first article, we learned how to start running Airflow in Docker, and in this post, we will provide an example of how you can run a DAG in Docker. We assume that you have already followed the steps of running Airflow in Docker and you are ready to run the compose.

Run the docker-compose

The first thing that you need to do is to set your working directory to your airflow directory which most probably consists of the following folders.

At this point, you can run the docker-compose command.

docker-compose up

Now, you should be able to enter the http://localhost:8080/home and to see the Airflow UI. Note that the credentials for username and password are airflow and airflow respectively.

#docker-compose #docker #airflow #data-engineering #data-orchestration