“All models are wrong, but some are useful.” — George Box.
To build a solution using Machine Learning (ML) is a complex task by itself. Whilst academic ML has its roots in research from the 1980s, the practical implementation of Machine Learning Systems in production is still relatively new.
Today I would like to share some ideas on how to make your ML Pipelines robust and interpretable using Apache Airflow. The whole project is available on Github. The code is dockerized, so it’s pretty straightforward to play around with it even if you are not familiar with the technology.
The topic is complex and multifaceted. In this article, I would like to focus only on the two parts of any ML project — Data Validation and Model Evaluation. The goal is to share practical ideas, that you can introduce in your project relatively simple, but still achieve great benefits.
The subject is very extensive, so let’s introduce several restrictions:
If you are not familiar with Airflow, it is a platform to programmatically author, schedule, and monitor workflows [1]. Airflow workflows are built as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. The command-line utilities make performing complex surgeries on DAGs a snap. The user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Data Validation is the process of ensuring that data is present, correct, and meaningful. Ensuring the quality of your data through automated validation checks is a critical step in building data pipelines at any organization.
The data validation step is required before model training to decide whether you could train the model or stop the execution of the pipeline. This decision is automatically made if the following was identified by the pipeline [2]:
Importance of the data validation:
Airflow provides a group of check operators, that allows us easy verify data quality. Let’s look at how to use such Operators on practical examples.
#data-science #data-engineering #airflow #data-pipeline #data analysis