Often, when you think about Machine Learning, you tend to think about the great models that you can now create. If you want to take these amazing models and make them available to the world, you will have to move beyond just training the model and incorporating data collection, feature engineering, training, evaluating, and serving. You will also have to figure out the lifecycle management of your data.

On top of all that, you will also have to remember that you’re putting a software application into production. That means you’ll have all the requirements that any production software has, including scalability, consistency, modularity, testability, and security. It’s way more than training an ML model now!

**An ML pipeline allows you to include many of the requirements for production software deployments and best practices. It will also **reduce the technical debt of a machine learning system, as this paper wonderfully describes. This segues into the fields of MLOps, a fast-growing field that, similar to DevOps, (aims to automate and monitor all steps of the ML System.

This tutorial will show you how to build a simple ML pipeline that automates the workflow of a deep learning image classifier for dandelions and grass built using FastAI, and then served as a web app using Starlette. We’ll use Apache AirFlow, out of the many workflow tools like Luigi, MLFlow, and KubeFlow, because it is widely adopted by companies and the open-source community. AirFlow is open-source software that allows you to programmatically author and schedule your workflows using a directed acyclic graph (DAG) and monitor them via the built-in Airflow user interface. At the end of the tutorial, I’ll show you further steps you can take to make your pipeline production-ready.

Requirements: You will need just the computer you have now, and a Google account!

This tutorial will be broken down into the following steps:

Sign up for Google Cloud Platform and create a compute instance
Pull tutorial contents from Github
Overview of ML pipeline in AirFlow
Install Docker & set up virtual hosts using nginx
Build and run a Docker container
Open Airflow UI and run ML pipeline
Run deployed web app

1. Sign in to Google Cloud Platform and Create a Compute Instance

Signing up for GCP is free!

If you haven’t already, sign up for Google Cloud Platform through your Google account. You’ll have to enter your credit card, but you won’t be charged anything upon signing up. You’ll also get $300 worth of free credits that last for 12 months! If you’ve run out of credits, don’t worry — running this tutorial will cost pennies, provided you stop your VM instance afterward!

Once you’re in the console, go to Compute Engine and create an instance. Then:

name the instance greenr-airflow
set the compute instance as n1-standard-8
set the OS to Ubuntu 18.04
ramp up the HDD memory to 30GB
Allow cloud access to APIs and HTTP/HTTPS traffic

SSH into your instance once your instance has been created and is running.

#machine-learning #devops #cloud-computing #pipeline #mlops

1. Sign in to Google Cloud Platform and Create a Compute Instance

towardsdatascience.com

10 Minutes to Building a Machine Learning Pipeline with Apache Airflow