Cayla  Erdman

Cayla Erdman

1599914520

Apache/Airflow and PostgreSQL with Docker and Docker Compose

Hello, in this post I will show you how to set up official Apache/Airflow with PostgreSQL and LocalExecutor using docker and docker-compose. In this post, I won’t be going through Airflow, what it is, and how it is used. Please checktheofficial documentation for more information about that.

Before setting up and running Apache Airflow, please install Docker and Docker Compose.

For those in hurry…

In this chapter, I will show you files and directories which are needed to run airflow and in the next chapter, I will go file by file, line by line explaining what is going on.

Firstly, in the root directory create three more directories: dagslogs, and scripts. Further, create following files: **.env, docker-compose.yml, entrypoint.sh **and **dummy_dag.py. **Please make sure those files and directories follow the structure below.

#project structure

root/
├── dags/
│   └── dummy_dag.py
├── scripts/
│   └── entrypoint.sh
├── logs/
├── .env
└── docker-compose.yml

Created files should contain the following:

#docker-compose.yml

version: '3.8'
services:
    postgres:
        image: postgres
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=airflow
            - POSTGRES_DB=airflow
    scheduler:
        image: apache/airflow
        command: scheduler
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
    webserver:
        image: apache/airflow
        entrypoint: ./scripts/entrypoint.sh
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./scripts:/opt/airflow/scripts
        ports:
            - "8080:8080"
#entrypoint.sh
#!/usr/bin/env bash
airflow initdb
airflow webserver
#.env
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
#dummy_dag.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
with DAG('example_dag', start_date=datetime(2016, 1, 1)) as dag:
    op = DummyOperator(task_id='op')

Positioning in the root directory and executing “docker-compose up” in the terminal should make airflow accessible on localhost:8080. Image bellow shows the final result.

If you encounter permission errors, please run “chmod -R 777” on all subdirectories, e.g. “chmod -R 777 logs/”


For the curious ones...

In Leyman’s terms, docker is used when managing individual containers and docker-compose can be used to manage multi-container applications. It also moves many of the options you would enter on the docker run into the docker-compose.yml file for easier reuse. It works as a front end "script" on top of the same docker API used by docker. You can do everything docker-compose does with docker commands and a lot of shell scripting.

Before running our multi-container docker applications, docker-compose.yml must be configured. With that file, we define services that will be run on docker-compose up.

The first attribute of docker-compose.yml is version, which is the compose file format version. For the most recent version of file format and all configuration options click here.

Second attribute is services and all attributes one level bellow services denote containers used in our multi-container application. These are postgres, scheduler and webserver. Each container has image attribute which points to base image used for that service.

For each service, we define environment variables used inside service containers. For postgres it is defined by environment attribute, but for scheduler and webserver it is defined by .env file. Because .env is an external file we must point to it with env_file attribute.

By opening .env file we can see two variables defined. One defines executor which will be used and the other denotes connection string. Each connection string must be defined in the following manner:

dialect+driver://username:password@host:port/database

Dialect names include the identifying name of the SQLAlchemy dialect, a name such as sqlite, mysql, postgresql, oracle, or mssql. Driver is the name of the DBAPI to be used to connect to the database using all lowercase letters. In our case, connection string is defined by:

postgresql+psycopg2://airflow:airflow@postgres/airflow

Omitting port after host part denotes that we will be using default postgres port defined in its own Dockerfile.

Every service can define command which will be run inside Docker container. If one service needs to execute multiple commands it can be done by defining an optional .sh file and pointing to it with entrypoint attribute. In our case we have entrypoint.sh inside the scripts folder which once executed, runs airflow initdb and airflow webserver. Both are mandatory for airflow to run properly.

Defining depends_on attribute, we can express dependency between services. In our example, webserver starts only if both scheduler and postgres have started, also the scheduler only starts after postgres have started.

In case our container crashes, we can restart it by restart_policy. The restart_policy configures if and how to restart containers when they exit. Additional options are condition, delay, max_attempts, and window.

Once service is running, it is being served on containers defined port. To access that service we need to expose the containers port to the host's port. That is being done by ports attribute. In our case, we are exposing port 8080 of the container to TCP port 8080 on 127.0.0.1 (localhost) of the host machine. Left side of : defines host machines port and the right-hand side defines containers port.

Lastly, the volumes attribute defines shared volumes (directories) between host file system and docker container. Because airflows default working directory is /opt/airflow/ we need to point our designated volumes from the root folder to the airflow containers working directory. Such is done by the following command:

#general case for airflow
- ./<our-root-subdir>:/opt/airflow/<our-root-subdir>
#our case
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
           ...

This way, when the scheduler or webserver writes logs to its logs directory we can access it from our file system within the logs directory. When we add a new dag to the dags folder it will automatically be added in the containers dag bag and so on.

Originally published by Ivan Rezic at Towardsdatascience

#docker #how-to #apache-airflow #docker-compose #postgresql

What is GEEK

Buddha Community

Apache/Airflow and PostgreSQL with Docker and Docker Compose
Cayla  Erdman

Cayla Erdman

1599914520

Apache/Airflow and PostgreSQL with Docker and Docker Compose

Hello, in this post I will show you how to set up official Apache/Airflow with PostgreSQL and LocalExecutor using docker and docker-compose. In this post, I won’t be going through Airflow, what it is, and how it is used. Please checktheofficial documentation for more information about that.

Before setting up and running Apache Airflow, please install Docker and Docker Compose.

For those in hurry…

In this chapter, I will show you files and directories which are needed to run airflow and in the next chapter, I will go file by file, line by line explaining what is going on.

Firstly, in the root directory create three more directories: dagslogs, and scripts. Further, create following files: **.env, docker-compose.yml, entrypoint.sh **and **dummy_dag.py. **Please make sure those files and directories follow the structure below.

#project structure

root/
├── dags/
│   └── dummy_dag.py
├── scripts/
│   └── entrypoint.sh
├── logs/
├── .env
└── docker-compose.yml

Created files should contain the following:

#docker-compose.yml

version: '3.8'
services:
    postgres:
        image: postgres
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=airflow
            - POSTGRES_DB=airflow
    scheduler:
        image: apache/airflow
        command: scheduler
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
    webserver:
        image: apache/airflow
        entrypoint: ./scripts/entrypoint.sh
        restart_policy:
            condition: on-failure
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./scripts:/opt/airflow/scripts
        ports:
            - "8080:8080"
#entrypoint.sh
#!/usr/bin/env bash
airflow initdb
airflow webserver
#.env
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
#dummy_dag.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
with DAG('example_dag', start_date=datetime(2016, 1, 1)) as dag:
    op = DummyOperator(task_id='op')

Positioning in the root directory and executing “docker-compose up” in the terminal should make airflow accessible on localhost:8080. Image bellow shows the final result.

If you encounter permission errors, please run “chmod -R 777” on all subdirectories, e.g. “chmod -R 777 logs/”


For the curious ones...

In Leyman’s terms, docker is used when managing individual containers and docker-compose can be used to manage multi-container applications. It also moves many of the options you would enter on the docker run into the docker-compose.yml file for easier reuse. It works as a front end "script" on top of the same docker API used by docker. You can do everything docker-compose does with docker commands and a lot of shell scripting.

Before running our multi-container docker applications, docker-compose.yml must be configured. With that file, we define services that will be run on docker-compose up.

The first attribute of docker-compose.yml is version, which is the compose file format version. For the most recent version of file format and all configuration options click here.

Second attribute is services and all attributes one level bellow services denote containers used in our multi-container application. These are postgres, scheduler and webserver. Each container has image attribute which points to base image used for that service.

For each service, we define environment variables used inside service containers. For postgres it is defined by environment attribute, but for scheduler and webserver it is defined by .env file. Because .env is an external file we must point to it with env_file attribute.

By opening .env file we can see two variables defined. One defines executor which will be used and the other denotes connection string. Each connection string must be defined in the following manner:

dialect+driver://username:password@host:port/database

Dialect names include the identifying name of the SQLAlchemy dialect, a name such as sqlite, mysql, postgresql, oracle, or mssql. Driver is the name of the DBAPI to be used to connect to the database using all lowercase letters. In our case, connection string is defined by:

postgresql+psycopg2://airflow:airflow@postgres/airflow

Omitting port after host part denotes that we will be using default postgres port defined in its own Dockerfile.

Every service can define command which will be run inside Docker container. If one service needs to execute multiple commands it can be done by defining an optional .sh file and pointing to it with entrypoint attribute. In our case we have entrypoint.sh inside the scripts folder which once executed, runs airflow initdb and airflow webserver. Both are mandatory for airflow to run properly.

Defining depends_on attribute, we can express dependency between services. In our example, webserver starts only if both scheduler and postgres have started, also the scheduler only starts after postgres have started.

In case our container crashes, we can restart it by restart_policy. The restart_policy configures if and how to restart containers when they exit. Additional options are condition, delay, max_attempts, and window.

Once service is running, it is being served on containers defined port. To access that service we need to expose the containers port to the host's port. That is being done by ports attribute. In our case, we are exposing port 8080 of the container to TCP port 8080 on 127.0.0.1 (localhost) of the host machine. Left side of : defines host machines port and the right-hand side defines containers port.

Lastly, the volumes attribute defines shared volumes (directories) between host file system and docker container. Because airflows default working directory is /opt/airflow/ we need to point our designated volumes from the root folder to the airflow containers working directory. Such is done by the following command:

#general case for airflow
- ./<our-root-subdir>:/opt/airflow/<our-root-subdir>
#our case
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
           ...

This way, when the scheduler or webserver writes logs to its logs directory we can access it from our file system within the logs directory. When we add a new dag to the dags folder it will automatically be added in the containers dag bag and so on.

Originally published by Ivan Rezic at Towardsdatascience

#docker #how-to #apache-airflow #docker-compose #postgresql

Haylie  Conn

Haylie Conn

1622020500

Using Apache Airflow DockerOperator with Docker Compose

Most of the tutorials in the interwebs around the DockerOperator are awesome, but they have a missing link that I want to cover here today that none of them assumes that you’re running Apache Airflow with Docker Compose.

All codes here and further instructions are in the repo fclesio/airflow-docker-operator-with-compose.

#docker #docker-compose #airflow #dockeroperator

Haylie  Conn

Haylie Conn

1623354300

Apache Airflow 2.0 Postgresql Complete Installation With Docker Explained

Apache Airflow is an open-source ETL tool, that helps to Extract the data from the source and then transform it according to our need, and finally, load it into the target database.

A Bonus point that’s what the ETL stands for (EXTRACT TRANSFORM AND LOAD)

We can schedule our ETL process in airflow according to our requirements.

Apache airflow is purely python-oriented.

The installation on the airflow can be tricky as it involves the different services that need to be set up. For example, for parallel processing we need PostgreSQL or MySQL instead of SQLite i.e the default Database for airflow for handling the metadata, and that we will be covering too.

This is the main reason why we install the airflow with docker. Directly installing docker is going to take care of all the complicated configuration settings and the service integration for us.

How to install Airflow 2.0 with WSL

We have two methods to install airflow. The first is with the Docker and the next is with the WSL(Window Subsystem For Linus) and we are going to discuss both.

#apache-airflow #python #docker #etl-tool #postgresql

Deploy Apache Airflow in Multiple Docker Containers

When it comes to data science models they are intended to run periodically. As an example if we are predicting customer churn for next month, the model has to be run on the last day of each month. Manually running this model monthly is not an option. We can use a scheduler to automate this process. Apache Airflow is an ideal tool for this as it allows to schedule and monitor your workflows. In this article we will be talking about how to deploy Apache Airflow using Docker by keep room to scale up further. Being familiar with Apache Airflow and Docker concepts will be an advantage to follow this article.

Introduction to Apache Airflow

Airflow consists of 3 major components; Web Server, Scheduler and a Meta Database. Web server is responsible for the user interface where the users can interact with the application. Scheduler is taking care of the job scheduling while Meta Database is storing the scheduling details. Even though Airflow has several executors, Celery executor is more suitable for scalability. With Celery executor 3 additional components are added to Airflow. They are Worker, Message Broker and Worker Monitor. Worker is responsible for executing jobs that are triggered by the scheduler. There can be multiple workers. These workers can be distributed across cluster instances. Number of workers can be decided upon the workload that has to be performed by the system along with machine capabilities. Message broker helps Celery to operate. A monitoring tool can be used to monitor Celery workers.

Image for post

Apache Airflow with Celery Executor (Image by author)

With Docker, we plan each of above component to be running inside an individual Docker container. Web Server, Scheduler and workers will use a common Docker image. This common image is unique to the project and the Dockerfile to build that image will be discussed. All the other containers will use publicly available images directly.

For this tutorial PostgreSQL is used as the Meta Database, Redis is used for the message broker and Celery Flower is used to monitor workers. Since there are multiple containers it will be easy to use Docker Compose in order to deploy all the containers at once.

#docker-compose #pythonoperator #docker #celery-executor #apache-airflow

Elton  Bogan

Elton Bogan

1596583680

Get a Fully Configured Apache Airflow Docker Dev Stack with Bitnami

I’ve been using it for around 2 years now to build out custom workflow interfaces, like those used for Laboratory Information Management Systems (LIMs), Computer Vision pre and postprocessing pipelines, and to set and forget other genomics pipelines.

My favorite feature of Airflow is how completely agnostic it is to the work you are doing or where that work is taking place. It could take place locally, on a Docker image, on Kubernetes, on any number of AWS services, on an HPC system, etc. Using Airflow allows me to concentrate on the business logic of what I’m trying to accomplish without getting too bogged down in implementation details.

During that time I’ve adopted a set of systems that I use to quickly build out the main development stack with Docker and Docker Compose, using the Bitnami Apache Airflow stack. Generally, I either deploy the stack to production using either the same Docker compose stack if its a small enough instance that is isolated, or with Kubernetes when I need to interact with other services or file systems.

Bitnami vs Roll Your Own

I used to roll my own Airflow containers using Conda. I still use this approach for most of my other containers, including microservices that interact with my Airflow system, but configuring Airflow is a lot more than just installing packages. Also, even just installing those packages is a pain and I could rarely count on a rebuild actually working without some pain. Then, on top of the packages you need to configure database connections and a message queue.

In comes the Bitnami Apache Airflow docker compose stack for dev and Bitnami Apache Airflow Helm Chart for prod!

Bitnami, in their own words:

_Bitnami makes it easy to get your favorite open source software up and running on any platform, including your laptop, Kubernetes and all the major clouds. In addition to popular community offerings, Bitnami, now part of VMware, provides IT organizations with an enterprise offering that is secure, compliant, continuously maintained and customizable to your organizational policies. _https://bitnami.com/

Bitnami stacks (usually) work completely the same from their Docker Compose stacks to their Helm charts. This means I can test and develop locally using my compose stack, build out new images, versions, packages, etc, and then deploy to Kubernetes. The configuration, environmental variables, and everything else acts the same. It would be a fairly large undertaking to do all this from scratch, so I use Bitnami.

They have plenty of enterprise offerings, but everything included here is open source and there is no paywall involved.

And no, I am not affiliated with Bitnami, although I have kids that eat a lot and don’t have any particular ethical aversions to selling out. ;-) I’ve just found their offerings to be excellent.

Project Structure

I like to have my projects organized so that I can run tree and have a general idea of what’s happening.

Apache Airflow has 3 main components, the application, the worker, and the scheduler. Each of these has it’s own Docker image to separate out the services. Additionally, there is a database and an message queue, but we won’t be doing any customization to these.

.
└── docker
    └── bitnami-apache-airflow-1.10.10
        ├── airflow
        │   └── Dockerfile
        ├── airflow-scheduler
        │   └── Dockerfile
        ├── airflow-worker
        │   └── Dockerfile
        ├── dags
        │   └── tutorial.py
        ├── docker-compose.yml

So what we have here is a directory called bitnami-apache-airflow-1.10.10. Which brings us to a very important point! Pin your versions! It will save you so, so much pain and frustration!

Then we have one Dockerfile per Airflow piece.

Create this directory structure with:

mkdir -p docker/bitnami-apache-airflow-1.10.10/{airflow,airflow-scheduler,airflow-worker,dags}

#data-science #docker #apache-airflow #docker-compose #python