Automated Machine Learning on the Cloud in Python

Automated Machine Learning on the Cloud in Python

<strong>An introduction to the future of data science, An introduction to the future of data science</strong>

An introduction to the future of data science, An introduction to the future of data science

An introduction to the future of data science

Two trends have recently become apparent in data science:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

This article will cover a brief introduction to these topics and show how to implement them, using Google Colaboratory to do automated machine learning on the cloud in Python.

Cloud Computing using Google Colab

Originally, all computing was done on a mainframe. You logged in via a terminal, and connected to a central machine where users simultaneously shared a single large computer. Then, along came microprocessors and the personal computer revolution and everyone got their own machine. Laptops and desktops work fine for routine tasks, but with the recent increase in size of datasets and computing power needed to run machine learning models, taking advantage of cloud resources is a necessity for data science.

Cloud computing in general refers to the “delivery of computing services over the Internet”. This covers a wide range of services, from databases to servers to software, but in this article we will run a simple data science workload on the cloud in the form of a Jupyter Notebook. We will use the relatively new Google Colaboratory service: online Jupyter Notebooks in Python which run on Google’s servers, can be accessed from anywhere with an internet connection, are free to use, and are shareable like any Google Doc.

Google Colab has made the process of using cloud computing a breeze. In the past, I spent dozens of hours configuring an Amazon EC2 instance so I could run a Jupyter Notebook on the cloud and had to pay by the hour! Fortunately, last year, Google announced you can now run Jupyter Notebooks on their Colab servers for up to 12 hours at a time completely free. (If that’s not enough, Google recently began letting users add a NVIDIA Tesla K80 GPU to the notebooks). The best part is these notebooks come pre-installed with most data science packages, and more can be easily added, so you don’t have to worry about the technical details of getting set up on your own machine.

To use Colab, all you need is an internet connection and a Google account. If you just want an introduction, head to colab.research.google.com and create a new notebook, or explore the tutorial Google has developed (called Hello, Colaboratory). To follow along with this article, get the notebook here. Sign into your Google account, open the notebook in Colaboratory, click File > save a copy in Drive, and you will then have your own version to edit and run.

Data science is becoming increasingly accessible with the wealth of resources online, and the Colab project has significantly lowered the barrier to cloud computing. For those who have done prior work in Jupyter Notebooks, it’s a completely natural transition, and for those who haven’t, it’s a great opportunity to get started with this commonly used data science tool!

Automated Machine Learning using TPOT

Automated machine learning (abbreviated auto-ml) aims to algorithmically design and optimize a machine learning pipeline for a particular problem. In this context, the machine learning pipeline consists of:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

There are an almost infinite number of ways these steps can be combined together, and the optimal solution will change for every problem! Designing a machine learning pipeline can be a time-consuming and frustrating process, and at the end, you will never know if the solution you developed is even close to optimal. Auto-ml can help by evaluating thousands of possible pipelines to try and find the best (or near-optimal) solution for a particular problem.

It’s important to remember that machine learning is only one part of the data science process, and automated machine learning is not meant to replace the data scientist. Instead, auto-ml is meant to free the data scientist so she can work on more valuable aspects of the process, such as gathering data or interpreting a model.

There are a number of auto-ml tools — H20, auto-sklearn, Google Cloud AutoML — and we will focus on TPOT: Tree-based Pipeline Optimization Tool developed by Randy Olson. TPOT (your “data-science assistant”) uses genetic programming to find the best machine learning pipeline.

Interlude: Genetic Programming

To use TPOT, it’s not really necessary to know the details of genetic programming, so you can skip this section. For those who are curious, at a high level, genetic programming for machine learning works as follows:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

(For more details on genetic programming, check out this short article.)

The primary benefit of genetic programming for building machine learning models is exploration. Even a human with no time restraints will not be able to try out all combinations of preprocessing, models, and hyperparameters because of limited knowledge and imagination. Genetic programming does not display an initial bias towards any particular sequence of machine learning steps, and with each generation, new pipelines are evaluated. Furthermore, the fitness function means that the most promising areas of the search space are explored more thoroughly than poorer-performing areas.

Putting it together: Automated Machine Learning on the Cloud

With the background in place, we can now walk through using TPOT in a Google Colab notebook to automatically design a machine learning pipeline. (Follow along with the notebook here).

Our task is a supervised regression problem: given New York City energy data, we want to predict the Energy Star Score of a building. In a previous series of articles (part one, part two, part three, code on GitHub), we built a complete machine learning solution for this problem. Using manual feature engineering, dimensionality reduction, model selection, and hyperparameter tuning, we designed a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06 points (on a scale from 1–100) on the test set.

The data contains several dozen continuous numeric variables (such as energy use and area of the building) and two one-hot encoded categorical variables (borough and building type) for a total of 82 features.

The score is the target for regression. All of the missing values have been encoded as np.nan and no feature preprocessing has been done to the data.

To get started, we first need to make sure TPOT is installed in the Google Colab environment. Most data science packages are already installed, but we can add any new ones using system commands (preceded with a ! in Jupyter):

!pip install TPOT

After reading in the data, we would normally fill in the missing values (imputation) and normalize the features to a range (scaling). However, in addition to feature engineering, model selection, and hyperparameter tuning, TPOT will automatically impute the missing values and do feature scaling! So, our next step is to create the TPOT optimizer:

The default parameters for TPOT optimizers will evaluate 100 populations of pipelines, each with 100 generations for a total of 10,000 pipelines. Using 10-fold cross validation, this represents 100,000 training runs! Even though we are using Google’s resources, we do not have unlimited time for training. To avoid running out of time on the Colab server (we get a max of 12 hours of continuous run time), we will set a maximum of 8 hours (480 minutes) for evaluation. TPOT is designed to be run for days, but we can still get good results from a few hours of optimization.

We set the following parameters in the call to the optimizer:

  • scoring = neg_mean_absolute error: Our regression performance metric
  • max_time_minutes = 480: Limit evaluation to 8 hours
  • n_jobs = -1: Use all available cores on the machine
  • verbosity = 2: Show a limited amount of information while training
  • cv = 5: Use 5-fold cross validation (default is 10)

There are other parameters that control details of the genetic programming method, but leaving them at the default works well for most cases. (If you want to play around with the parameters, check out the documentation.)

The syntax for TPOT optimizers is designed to be identical to that for Scikit-Learn models so we can train the optimizer using the .fitmethod.

# Fit the tpot optimizer on the training data
tpot.fit(training_features, training_targets)

During training, we get some information displayed:

Due to the time limit, our model was only able to get through 15 generations. With 100 populations, this still represents 1500 different individual pipelines that were evaluated, quite a few more than we could have tried by hand!

Once the model has trained, we can see the optimal pipeline using tpot.fitted_pipeline_. We can also save the model to a Python script:

# Export the pipeline as a python script file
tpot.export('tpot_exported_pipeline.py')

Since we are in a Google Colab notebook, to get the pipeline onto a local machine from the server, we have to use the Google Colab library:

# Import file management
from google.colab import file
# Download the pipeline for local use
files.download('tpot_exported_pipeline.py')

We can then open the file (available here) and look at the completed pipeline:

# Evaluate the final model
print(tpot.score(testing_features, testing_targets))
8.642

We see that the optimizer imputed the missing values for us and built a complete model pipeline! The final estimator is a stacked modelmeaning that it uses two machine learning algorithms ( [LassoLarsCV]([http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html) "http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)") and [GradientBoostingRegressor]([http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)") ), the second of which is trained on the predictions of the first (If you run the notebook again, you may get a different model because the optimization process is stochastic). This is a complex method that I probably would not have been able to develop on my own!

Now, the moment of truth: performance on the testing set. To find the mean absolute error, we can use the .score method:

# Evaluate the final model
print(tpot.score(testing_features, testing_targets))
8.642

In the series of articles where we developed a solution manually, after many hours of development, we built a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06. Automated machine learning has significantly improved on the performance with a drastic reduction in the amount of development time.

From here, we can use the optimized pipeline and try to further refine the solution, or we can move on to other important phases of the data science pipeline. If we use this as our final model, we could try and interpret the model (such as by using LIME: Local Interpretable Model-Agnostic Explainations) or write a well-documented report.

Conclusions

In this post, we got a brief introduction to both the capabilities of the cloud and automated machine learning. With only a Google account and an internet connection, we can use Google Colab to develop, run, and share machine learning or data science work loads. Using TPOT, we can automatically develop an optimized machine learning pipeline with feature preprocessing, model selection, and hyperparameter tuning. Moreover, we saw that auto-ml will not replace the data scientist, but it will allow her to spend more time on higher value parts of the workflow.

While being an early adopter does not always pay off, in this case, TPOT is mature enough to be easy to use and relatively issue-free, yet also new enough that learning it will put you ahead of the curve. With that in mind, find a machine learning problem (perhaps through Kaggle) and try to solve it! Running automatic machine learning in a notebook on Google Colab feels like the future and with such a low barrier to entry, there’s never been a better time to get started!

Learn More

Interactive Data Visualization with Python & Bokeh

Master the Python Interview | Real Banks&Startups questions

Ensemble Machine Learning in Python: Random Forest, AdaBoost

Python Network Programming - Part 1: Build 7 Python Apps

Deep Learning Prerequisites: Logistic Regression in Python

An introduction to the future of data science, An introduction to the future of data science

An introduction to the future of data science

Two trends have recently become apparent in data science:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

This article will cover a brief introduction to these topics and show how to implement them, using Google Colaboratory to do automated machine learning on the cloud in Python.

Cloud Computing using Google Colab

Originally, all computing was done on a mainframe. You logged in via a terminal, and connected to a central machine where users simultaneously shared a single large computer. Then, along came microprocessors and the personal computer revolution and everyone got their own machine. Laptops and desktops work fine for routine tasks, but with the recent increase in size of datasets and computing power needed to run machine learning models, taking advantage of cloud resources is a necessity for data science.

Cloud computing in general refers to the “delivery of computing services over the Internet”. This covers a wide range of services, from databases to servers to software, but in this article we will run a simple data science workload on the cloud in the form of a Jupyter Notebook. We will use the relatively new Google Colaboratory service: online Jupyter Notebooks in Python which run on Google’s servers, can be accessed from anywhere with an internet connection, are free to use, and are shareable like any Google Doc.

Google Colab has made the process of using cloud computing a breeze. In the past, I spent dozens of hours configuring an Amazon EC2 instance so I could run a Jupyter Notebook on the cloud and had to pay by the hour! Fortunately, last year, Google announced you can now run Jupyter Notebooks on their Colab servers for up to 12 hours at a time completely free. (If that’s not enough, Google recently began letting users add a NVIDIA Tesla K80 GPU to the notebooks). The best part is these notebooks come pre-installed with most data science packages, and more can be easily added, so you don’t have to worry about the technical details of getting set up on your own machine.

To use Colab, all you need is an internet connection and a Google account. If you just want an introduction, head to colab.research.google.com and create a new notebook, or explore the tutorial Google has developed (called Hello, Colaboratory). To follow along with this article, get the notebook here. Sign into your Google account, open the notebook in Colaboratory, click File > save a copy in Drive, and you will then have your own version to edit and run.

Data science is becoming increasingly accessible with the wealth of resources online, and the Colab project has significantly lowered the barrier to cloud computing. For those who have done prior work in Jupyter Notebooks, it’s a completely natural transition, and for those who haven’t, it’s a great opportunity to get started with this commonly used data science tool!

Automated Machine Learning using TPOT

Automated machine learning (abbreviated auto-ml) aims to algorithmically design and optimize a machine learning pipeline for a particular problem. In this context, the machine learning pipeline consists of:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

There are an almost infinite number of ways these steps can be combined together, and the optimal solution will change for every problem! Designing a machine learning pipeline can be a time-consuming and frustrating process, and at the end, you will never know if the solution you developed is even close to optimal. Auto-ml can help by evaluating thousands of possible pipelines to try and find the best (or near-optimal) solution for a particular problem.

It’s important to remember that machine learning is only one part of the data science process, and automated machine learning is not meant to replace the data scientist. Instead, auto-ml is meant to free the data scientist so she can work on more valuable aspects of the process, such as gathering data or interpreting a model.

There are a number of auto-ml tools — H20, auto-sklearn, Google Cloud AutoML — and we will focus on TPOT: Tree-based Pipeline Optimization Tool developed by Randy Olson. TPOT (your “data-science assistant”) uses genetic programming to find the best machine learning pipeline.

Interlude: Genetic Programming

To use TPOT, it’s not really necessary to know the details of genetic programming, so you can skip this section. For those who are curious, at a high level, genetic programming for machine learning works as follows:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

(For more details on genetic programming, check out this short article.)

The primary benefit of genetic programming for building machine learning models is exploration. Even a human with no time restraints will not be able to try out all combinations of preprocessing, models, and hyperparameters because of limited knowledge and imagination. Genetic programming does not display an initial bias towards any particular sequence of machine learning steps, and with each generation, new pipelines are evaluated. Furthermore, the fitness function means that the most promising areas of the search space are explored more thoroughly than poorer-performing areas.

Putting it together: Automated Machine Learning on the Cloud

With the background in place, we can now walk through using TPOT in a Google Colab notebook to automatically design a machine learning pipeline. (Follow along with the notebook here).

Our task is a supervised regression problem: given New York City energy data, we want to predict the Energy Star Score of a building. In a previous series of articles (part one, part two, part three, code on GitHub), we built a complete machine learning solution for this problem. Using manual feature engineering, dimensionality reduction, model selection, and hyperparameter tuning, we designed a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06 points (on a scale from 1–100) on the test set.

The data contains several dozen continuous numeric variables (such as energy use and area of the building) and two one-hot encoded categorical variables (borough and building type) for a total of 82 features.

The score is the target for regression. All of the missing values have been encoded as np.nan and no feature preprocessing has been done to the data.

To get started, we first need to make sure TPOT is installed in the Google Colab environment. Most data science packages are already installed, but we can add any new ones using system commands (preceded with a ! in Jupyter):

!pip install TPOT

After reading in the data, we would normally fill in the missing values (imputation) and normalize the features to a range (scaling). However, in addition to feature engineering, model selection, and hyperparameter tuning, TPOT will automatically impute the missing values and do feature scaling! So, our next step is to create the TPOT optimizer:

The default parameters for TPOT optimizers will evaluate 100 populations of pipelines, each with 100 generations for a total of 10,000 pipelines. Using 10-fold cross validation, this represents 100,000 training runs! Even though we are using Google’s resources, we do not have unlimited time for training. To avoid running out of time on the Colab server (we get a max of 12 hours of continuous run time), we will set a maximum of 8 hours (480 minutes) for evaluation. TPOT is designed to be run for days, but we can still get good results from a few hours of optimization.

We set the following parameters in the call to the optimizer:

  • scoring = neg_mean_absolute error: Our regression performance metric
  • max_time_minutes = 480: Limit evaluation to 8 hours
  • n_jobs = -1: Use all available cores on the machine
  • verbosity = 2: Show a limited amount of information while training
  • cv = 5: Use 5-fold cross validation (default is 10)

There are other parameters that control details of the genetic programming method, but leaving them at the default works well for most cases. (If you want to play around with the parameters, check out the documentation.)

The syntax for TPOT optimizers is designed to be identical to that for Scikit-Learn models so we can train the optimizer using the .fitmethod.

# Fit the tpot optimizer on the training data
tpot.fit(training_features, training_targets)

During training, we get some information displayed:

Due to the time limit, our model was only able to get through 15 generations. With 100 populations, this still represents 1500 different individual pipelines that were evaluated, quite a few more than we could have tried by hand!

Once the model has trained, we can see the optimal pipeline using tpot.fitted_pipeline_. We can also save the model to a Python script:

# Export the pipeline as a python script file
tpot.export('tpot_exported_pipeline.py')

Since we are in a Google Colab notebook, to get the pipeline onto a local machine from the server, we have to use the Google Colab library:

# Import file management
from google.colab import file
# Download the pipeline for local use
files.download('tpot_exported_pipeline.py')

We can then open the file (available here) and look at the completed pipeline:

# Evaluate the final model
print(tpot.score(testing_features, testing_targets))
8.642

We see that the optimizer imputed the missing values for us and built a complete model pipeline! The final estimator is a stacked modelmeaning that it uses two machine learning algorithms ( [LassoLarsCV]([http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html) "http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)") and [GradientBoostingRegressor]([http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)") ), the second of which is trained on the predictions of the first (If you run the notebook again, you may get a different model because the optimization process is stochastic). This is a complex method that I probably would not have been able to develop on my own!

Now, the moment of truth: performance on the testing set. To find the mean absolute error, we can use the .score method:

# Evaluate the final model
print(tpot.score(testing_features, testing_targets))
8.642

In the series of articles where we developed a solution manually, after many hours of development, we built a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06. Automated machine learning has significantly improved on the performance with a drastic reduction in the amount of development time.

From here, we can use the optimized pipeline and try to further refine the solution, or we can move on to other important phases of the data science pipeline. If we use this as our final model, we could try and interpret the model (such as by using LIME: Local Interpretable Model-Agnostic Explainations) or write a well-documented report.

Conclusions

In this post, we got a brief introduction to both the capabilities of the cloud and automated machine learning. With only a Google account and an internet connection, we can use Google Colab to develop, run, and share machine learning or data science work loads. Using TPOT, we can automatically develop an optimized machine learning pipeline with feature preprocessing, model selection, and hyperparameter tuning. Moreover, we saw that auto-ml will not replace the data scientist, but it will allow her to spend more time on higher value parts of the workflow.

While being an early adopter does not always pay off, in this case, TPOT is mature enough to be easy to use and relatively issue-free, yet also new enough that learning it will put you ahead of the curve. With that in mind, find a machine learning problem (perhaps through Kaggle) and try to solve it! Running automatic machine learning in a notebook on Google Colab feels like the future and with such a low barrier to entry, there’s never been a better time to get started!

Learn More

Interactive Data Visualization with Python & Bokeh

Master the Python Interview | Real Banks&Startups questions

Ensemble Machine Learning in Python: Random Forest, AdaBoost

Python Network Programming - Part 1: Build 7 Python Apps

Deep Learning Prerequisites: Logistic Regression in Python

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

In this intermediate deep learning tutorial, you will learn how to go from reading a paper on deep deterministic policy gradients to implementing the concepts in Tensorflow. This process can be applied to any deep learning paper, not just deep reinforcement learning.

In the second part, you will learn how to code a deep deterministic policy gradient (DDPG) agent using Python and PyTorch, to beat the continuous lunar lander environment (a classic machine learning problem).

DDPG combines the best of Deep Q Learning and Actor Critic Methods into an algorithm that can solve environments with continuous action spaces. We will have an actor network that learns the (deterministic) policy, coupled with a critic network to learn the action-value functions. We will make use of a replay buffer to maximize sample efficiency, as well as target networks to assist in algorithm convergence and stability.

Thanks for watching

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Deep Learning and Python

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers

Machine Learning for Front-End Developers

Top 30 Python Libraries for Machine Learning

Deep Learning Tutorial with Python | Machine Learning with Neural Networks

Deep Learning Tutorial with Python | Machine Learning with Neural Networks

In this video, Deep Learning Tutorial with Python | Machine Learning with Neural Networks Explained, Frank Kane helps de-mystify the world of deep learning and artificial neural networks with Python!

Explore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:

•    Deep Learning Pre-requistes (gradient descent, autodiff, softmax)

•    The History of Artificial Neural Networks

•    Deep Learning in the Tensorflow Playground

•    Deep Learning Details

•    Introducing Tensorflow

•    Using Tensorflow

•    Introducing Keras

•    Using Keras to Predict Political Parties

•    Convolutional Neural Networks (CNNs)

•    Using CNNs for Handwriting Recognition

•    Recurrent Neural Networks (RNNs)

•    Using a RNN for Sentiment Analysis

•    The Ethics of Deep Learning

•    Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers

How to Perform Face Detection with Deep Learning in Python

How to Perform Face Detection with Deep Learning in Python

In this tutorial, you will discover how to perform face detection in Python using classical and Deep Learning models.

In this tutorial, you will discover how to perform face detection in Python using classical and Deep Learning models.

Face detection is a computer vision problem that involves finding *faces *in photos.

It is a trivial problem for humans to solve and has been solved reasonably well by classical feature-based techniques, such as the cascade classifier. More recently deep learning methods have achieved state-of-the-art results on standard benchmark face detection datasets. One example is the Multi-task Cascade Convolutional Neural Network, or MTCNN for short.

In this tutorial, you will discover how to perform face detection in Python using classical and deep learning models.

After completing this tutorial, you will know:

  • Face detection is a non-trivial computer vision problem for identifying and localizing faces in images.
  • Face detection can be performed using the classical feature-based cascade classifier using the OpenCV library.
  • State-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library.

Discover how to build models for photo classification, object detection, face recognition, and more in my new computer vision book, with 30 step-by-step tutorials and full source code.

Let’s get started.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Face Detection
  2. Test Photographs
  3. Face Detection With OpenCV
  4. Face Detection With Deep Learning
Face Detection

Face detection is a problem in computer vision of locating and localizing one or more faces in a photograph.

Locating a face in a photograph refers to finding the coordinate of the face in the image, whereas localization refers to demarcating the extent of the face, often via a bounding box around the face.

A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
Detecting faces in a photograph is easily solved by humans, although has historically been challenging for computers given the dynamic nature of faces. For example, faces must be detected regardless of orientation or angle they are facing, light levels, clothing, accessories, hair color, facial hair, makeup, age, and so on.
A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
Given a photograph, a face detection system will output zero or more bounding boxes that contain faces. Detected faces can then be provided as input to a subsequent system
A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
There are perhaps two main approaches to face recognition: feature-based methods that use hand-crafted filters to search for and detect faces, and image-based methods that learn holistically how to extract faces from the entire image.

Test Photographs

We need test images for** face detection** in this tutorial.

To keep things simple, we will use two test images: one with two faces, and one with many faces. We’re not trying to push the limits of face detection, just demonstrate how to perform face detection with normal front-on photographs of people.

The first image is a photo of two college students taken by CollegeDegrees360 and made available under a permissive license.

Download the image and place it in your current working directory with the filename ‘test1.jpg‘.

College Students (test1.jpg)

Photo by CollegeDegrees360, some rights reserved.

  • Face detection is a non-trivial computer vision problem for identifying and localizing faces in images.
  • Face detection can be performed using the classical feature-based cascade classifier using the OpenCV library.
  • State-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library.

The second image is a photograph of a number of people on a swim team taken by Bob n Renee and released under a permissive license.

Download the image and place it in your current working directory with the filename ‘test2.jpg‘.

Swim Team (test2.jpg)

Photo by Bob n Renee, some rights reserved.

  • Face detection is a non-trivial computer vision problem for identifying and localizing faces in images.
  • Face detection can be performed using the classical feature-based cascade classifier using the OpenCV library.
  • State-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library.
Face Detection With OpenCV

Feature-based face detection algorithms are fast and effective and have been used successfully for decades.

Perhaps the most successful example is a technique called cascade classifiers first described by Paul Viola and Michael Jones and their 2001 paper titled “Rapid Object Detection using a Boosted Cascade of Simple Features.”

In the paper, effective features are learned using the AdaBoost algorithm, although importantly, multiple models are organized into a hierarchy or “cascade.”

In the paper, the AdaBoost model is used to learn a range of very simple or weak features in each face, that together provide a robust classifier.

A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
The models are then organized into a hierarchy of increasing complexity, called a “cascade“.

Simpler classifiers operate on candidate face regions directly, acting like a coarse filter, whereas complex classifiers operate only on those candidate regions that show the most promise as faces.

A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
The result is a very fast and effective face detection algorithm that has been the basis for face detection in consumer products, such as cameras.
A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
It is a modestly complex classifier that has also been tweaked and refined over the last nearly 20 years.

A modern implementation of the Classifier Cascade face detection algorithm is provided in the OpenCV library. This is a C++ computer vision library that provides a python interface. The benefit of this implementation is that it provides pre-trained face detection models, and provides an interface to train a model on your own dataset.

OpenCV can be installed by the package manager system on your platform, or via pip; for example:

sudo pip install opencv-python

Once the installation process is complete, it is important to confirm that the library was installed correctly.

This can be achieved by importing the library and checking the version number; for example:

# check opencv version
import cv2
# print version number
print(cv2.__version__)

Running the example will import the library and print the version. In this case, we are using version 3 of the library.

3.4.3

OpenCV provides the CascadeClassifier class that can be used to create a cascade classifier for face detection. The constructor can take a filename as an argument that specifies the XML file for a pre-trained model.

OpenCV provides a number of pre-trained models as part of the installation. These are available on your system and are also available on the OpenCV GitHub project.

Download a pre-trained model for frontal face detection from the OpenCV GitHub project and place it in your current working directory with the filename ‘haarcascade_frontalface_default.xml‘.

  • Face detection is a non-trivial computer vision problem for identifying and localizing faces in images.
  • Face detection can be performed using the classical feature-based cascade classifier using the OpenCV library.
  • State-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library.

Once downloaded, we can load the model as follows:

# load the pre-trained model
classifier = CascadeClassifier('haarcascade_frontalface_default.xml')

Once loaded, the model can be used to perform face detection on a photograph by calling the detectMultiScale() function.

This function will return a list of bounding boxes for all faces detected in the photograph.

# perform face detection
bboxes = classifier.detectMultiScale(pixels)
# print bounding box for each detected face
for box in bboxes:
	print(box)

We can demonstrate this with an example with the college students photograph (test.jpg).

The photo can be loaded using **OpenCV **via the imread() function.

# load the photograph
pixels = imread('test1.jpg')

The complete example of performing face detection on the college students photograph with a pre-trained cascade classifier in **OpenCV **is listed below.

# example of face detection with opencv cascade classifier
from cv2 import imread
from cv2 import CascadeClassifier
# load the photograph
pixels = imread('test1.jpg')
# load the pre-trained model
classifier = CascadeClassifier('haarcascade_frontalface_default.xml')
# perform face detection
bboxes = classifier.detectMultiScale(pixels)
# print bounding box for each detected face
for box in bboxes:
	print(box)

Running the example first loads the photograph, then loads and configures the cascade classifier; faces are detected and each bounding box is printed.

Each box lists the x and y coordinates for the bottom-left-hand-corner of the bounding box, as well as the width and the height. The results suggest that two bounding boxes were detected.

[174&nbsp; 75 107 107]
[360 102 101 101]

We can update the example to plot the photograph and draw each bounding box.

This can be achieved by drawing a rectangle for each box directly over the pixels of the loaded image using the rectangle() function that takes two points.

# extract
x, y, width, height = box
x2, y2 = x + width, y + height
# draw a rectangle over the pixels
rectangle(pixels, (x, y), (x2, y2), (0,0,255), 1)

We can then plot the photograph and keep the window open until we press a key to close it.

# show the image
imshow('face detection', pixels)
# keep the window open until we press a key
waitKey(0)
# close the window
destroyAllWindows()

The complete example is listed below.

# plot photo with detected faces using opencv cascade classifier
from cv2 import imread
from cv2 import imshow
from cv2 import waitKey
from cv2 import destroyAllWindows
from cv2 import CascadeClassifier
from cv2 import rectangle
# load the photograph
pixels = imread('test1.jpg')
# load the pre-trained model
classifier = CascadeClassifier('haarcascade_frontalface_default.xml')
# perform face detection
bboxes = classifier.detectMultiScale(pixels)
# print bounding box for each detected face
for box in bboxes:
	# extract
	x, y, width, height = box
	x2, y2 = x + width, y + height
	# draw a rectangle over the pixels
	rectangle(pixels, (x, y), (x2, y2), (0,0,255), 1)
# show the image
imshow('face detection', pixels)
# keep the window open until we press a key
waitKey(0)
# close the window
destroyAllWindows()

Running the example, we can see that the photograph was plotted correctly and that each face was correctly detected.

College Students Photograph With Faces Detected using OpenCV Cascade Classifier

We can try the same code on the second photograph of the swim team, specifically ‘test2.jpg‘.

# load the photograph
pixels = imread('test2.jpg')

Running the example, we can see that many of the faces were detected correctly, but the result is not perfect.

We can see that a face on the first or bottom row of people was detected twice, that a face on the middle row of people was not detected, and that the background on the third or top row was detected as a face.

Swim Team Photograph With Faces Detected using OpenCV Cascade Classifier

The detectMultiScale() function provides some arguments to help tune the usage of the classifier. Two parameters of note are scaleFactor and minNeighbors; for example:

# perform face detection
bboxes = classifier.detectMultiScale(pixels, 1.1, 3)

The scaleFactor controls how the input image is scaled prior to detection, e.g. is it scaled up or down, which can help to better find the faces in the image. The default value is 1.1 (10% increase), although this can be lowered to values such as 1.05 (5% increase) or raised to values such as 1.4 (40% increase).

The minNeighbors determines how robust each detection must be in order to be reported, e.g. the number of candidate rectangles that found the face. The default is 3, but this can be lowered to 1 to detect a lot more faces and will likely increase the false positives, or increase to 6 or more to require a lot more confidence before a face is detected.

The scaleFactor and minNeighbors often require tuning for a given image or dataset in order to best detect the faces. It may be helpful to perform a sensitivity analysis across a grid of values and see what works well or best in general on one or multiple photographs.

A fast strategy may be to lower (or increase for small photos) the scaleFactor until all faces are detected, then increase the minNeighbors until all false positives disappear, or close to it.

With some tuning, I found that a scaleFactor of 1.05 successfully detected all of the faces, but the background detected as a face did not disappear until a minNeighbors of 8, after which three faces on the middle row were no longer detected.

# perform face detection
bboxes = classifier.detectMultiScale(pixels, 1.05, 8)

The results are not perfect, and perhaps better results can be achieved with further tuning, and perhaps post-processing of the bounding boxes.

Swim Team Photograph With Faces Detected Using OpenCV Cascade Classifier After Some Tuning

Face Detection With Deep Learning

A number of deep learning methods have been developed and demonstrated for face detection.

Perhaps one of the more popular approaches is called the “Multi-Task Cascaded Convolutional Neural Network,” or MTCNN for short, described by Kaipeng Zhang, et al. in the 2016 paper titled “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.”

The MTCNN is popular because it achieved then state-of-the-art results on a range of benchmark datasets, and because it is capable of also recognizing other facial features such as eyes and mouth, called landmark detection.

The network uses a cascade structure with three networks; first the image is rescaled to a range of different sizes (called an image pyramid), then the first model (Proposal Network or P-Net) proposes candidate facial regions, the second model (Refine Network or R-Net) filters the bounding boxes, and the third model (Output Network or O-Net) proposes facial landmarks.

A general statement of the problem can be defined as follows: Given a still or video image, detect and localize an unknown number (if any) of faces — Face Detection: A Survey, 2001.
The image below taken from the paper provides a helpful summary of the three stages from top-to-bottom and the output of each stage left-to-right.

Pipeline for the Multi-Task Cascaded Convolutional Neural NetworkTaken from: Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.

The model is called a multi-task network because each of the three models in the cascade (P-Net, R-Net and O-Net) are trained on three tasks, e.g. make three types of predictions; they are: face classification, bounding box regression, and facial landmark localization.

The three models are not connected directly; instead, outputs of the previous stage are fed as input to the next stage. This allows additional processing to be performed between stages; for example, non-maximum suppression (NMS) is used to filter the candidate bounding boxes proposed by the first-stage P-Net prior to providing them to the second stage R-Net model.

The MTCNN architecture is reasonably complex to implement. Thankfully, there are open source implementations of the architecture that can be trained on new datasets, as well as pre-trained models that can be used directly for face detection. Of note is the official release with the code and models used in the paper, with the implementation provided in the Caffe deep learning framework.

Perhaps the best-of-breed third-party Python-based MTCNN project is called “MTCNN” by Iván de Paz Centeno, or ipazc, made available under a permissive MIT open source license. As a third-party open-source project, it is subject to change, therefore I have a fork of the project at the time of writing available here.

The MTCNN project, which we will refer to as ipazc/MTCNN to differentiate it from the name of the network, provides an implementation of the MTCNN architecture using **TensorFlow **and OpenCV. There are two main benefits to this project; first, it provides a top-performing pre-trained model and the second is that it can be installed as a library ready for use in your own code.

The library can be installed via pip; for example:

sudo pip install mtcnn

After successful installation, you should see a message like:

Successfully installed mtcnn-0.0.8

You can then confirm that the library was installed correctly via pip; for example:

sudo pip show mtcnn 

You should see output like that listed below. In this case, you can see that we are using version 0.0.8 of the library.

Name: mtcnn
Version: 0.0.8
Summary: Multi-task Cascaded Convolutional Neural Networks for Face Detection, based on TensorFlow
Home-page: http://github.com/ipazc/mtcnn
Author: Iván de Paz Centeno
Author-email: [email protected]
License: MIT
Location: ...
Requires:
Required-by:

You can also confirm that the library was installed correctly via Python, as follows:

# confirm mtcnn was installed correctly
import mtcnn
# print version
print(mtcnn.__version__)

Running the example will load the library, confirming it was installed correctly; and print the version.

0.0.8

Now that we are confident that the library was installed correctly, we can use it for face detection.

An instance of the network can be created by calling the MTCNN() constructor.

By default, the library will use the pre-trained model, although you can specify your own model via the ‘weights_file‘ argument and specify a path or URL, for example:

model = MTCNN(weights_file='filename.npy')

The minimum box size for detecting a face can be specified via the ‘min_face_size‘ argument, which defaults to 20 pixels. The constructor also provides a ‘scale_factor‘ argument to specify the scale factor for the input image, which defaults to 0.709.

Once the model is configured and loaded, it can be used directly to detect faces in photographs by calling the detect_faces() function.

This returns a list of dict object, each providing a number of keys for the details of each face detected, including:

  • Face detection is a non-trivial computer vision problem for identifying and localizing faces in images.
  • Face detection can be performed using the classical feature-based cascade classifier using the OpenCV library.
  • State-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library.

For example, we can perform face detection on the college students photograph as follows:

# face detection with mtcnn on a photograph
from matplotlib import pyplot
from mtcnn.mtcnn import MTCNN
# load image from file
filename = 'test1.jpg'
pixels = pyplot.imread(filename)
# create the detector, using default weights
detector = MTCNN()
# detect faces in the image
faces = detector.detect_faces(pixels)
for face in faces:
	print(face)

Running the example loads the photograph, loads the model, performs face detection, and prints a list of each face detected.

{'box': [186, 71, 87, 115], 'confidence': 0.9994562268257141, 'keypoints': {'left_eye': (207, 110), 'right_eye': (252, 119), 'nose': (220, 143), 'mouth_left': (200, 148), 'mouth_right': (244, 159)}}
{'box': [368, 75, 108, 138], 'confidence': 0.998593270778656, 'keypoints': {'left_eye': (392, 133), 'right_eye': (441, 140), 'nose': (407, 170), 'mouth_left': (388, 180), 'mouth_right': (438, 185)}}

We can draw the boxes on the image by first plotting the image with matplotlib, then creating a Rectangle object using the x, y and width and height of a given bounding box; for example:

# get coordinates
x, y, width, height = result['box']
# create the shape
rect = Rectangle((x, y), width, height, fill=False, color='red')

Below is a function named drawimagewith_boxes() that shows the photograph and then draws a box for each bounding box detected.

# draw an image with detected objects
def draw_image_with_boxes(filename, result_list):
	# load the image
	data = pyplot.imread(filename)
	# plot the image
	pyplot.imshow(data)
	# get the context for drawing boxes
	ax = pyplot.gca()
	# plot each box
	for result in result_list:
		# get coordinates
		x, y, width, height = result['box']
		# create the shape
		rect = Rectangle((x, y), width, height, fill=False, color='red')
		# draw the box
		ax.add_patch(rect)
	# show the plot
	pyplot.show()

The complete example making use of this function is listed below.

# face detection with mtcnn on a photograph
from matplotlib import pyplot
from matplotlib.patches import Rectangle
from mtcnn.mtcnn import MTCNN


# draw an image with detected objects
def draw_image_with_boxes(filename, result_list):
	# load the image
	data = pyplot.imread(filename)
	# plot the image
	pyplot.imshow(data)
	# get the context for drawing boxes
	ax = pyplot.gca()
	# plot each box
	for result in result_list:
		# get coordinates
		x, y, width, height = result['box']
		# create the shape
		rect = Rectangle((x, y), width, height, fill=False, color='red')
		# draw the box
		ax.add_patch(rect)
	# show the plot
	pyplot.show()


filename = 'test1.jpg'
# load image from file
pixels = pyplot.imread(filename)
# create the detector, using default weights
detector = MTCNN()
# detect faces in the image
faces = detector.detect_faces(pixels)
# display faces on the original image
draw_image_with_boxes(filename, faces)

Running the example plots the photograph then draws a bounding box for each of the detected faces.

We can see that both faces were detected correctly.

College Students Photograph With Bounding Boxes Drawn for Each Detected Face Using MTCNN

We can draw a circle via the Circle class for the eyes, nose, and mouth; for example

# draw the dots
for key, value in result['keypoints'].items():
	# create and draw dot
	dot = Circle(value, radius=2, color='red')
	ax.add_patch(dot)

The complete example with this addition to the drawimagewith_boxes() function is listed below.

# face detection with mtcnn on a photograph
from matplotlib import pyplot
from matplotlib.patches import Rectangle
from matplotlib.patches import Circle
from mtcnn.mtcnn import MTCNN


# draw an image with detected objects
def draw_image_with_boxes(filename, result_list):
	# load the image
	data = pyplot.imread(filename)
	# plot the image
	pyplot.imshow(data)
	# get the context for drawing boxes
	ax = pyplot.gca()
	# plot each box
	for result in result_list:
		# get coordinates
		x, y, width, height = result['box']
		# create the shape
		rect = Rectangle((x, y), width, height, fill=False, color='red')
		# draw the box
		ax.add_patch(rect)
		# draw the dots
		for key, value in result['keypoints'].items():
			# create and draw dot
			dot = Circle(value, radius=2, color='red')
			ax.add_patch(dot)
	# show the plot
	pyplot.show()


filename = 'test1.jpg'
# load image from file
pixels = pyplot.imread(filename)
# create the detector, using default weights
detector = MTCNN()
# detect faces in the image
faces = detector.detect_faces(pixels)
# display faces on the original image
draw_image_with_boxes(filename, faces)

The example plots the photograph again with bounding boxes and facial key points.

We can see that eyes, nose, and mouth are detected well on each face, although the mouth on the right face could be better detected, with the points looking a little lower than the corners of the mouth.

College Students Photograph With Bounding Boxes and Facial Keypoints Drawn for Each Detected Face Using MTCNN

We can now try face detection on the swim team photograph, e.g. the image test2.jpg.

Running the example, we can see that all thirteen faces were correctly detected and that it looks roughly like all of the facial keypoints are also correct.

Swim Team Photograph With Bounding Boxes and Facial Keypoints Drawn for Each Detected Face Using MTCNN

We may want to extract the detected faces and pass them as input to another system.

This can be achieved by extracting the pixel data directly out of the photograph; for example:

# get coordinates
x1, y1, width, height = result['box']
x2, y2 = x1 + width, y1 + height
# extract face
face = data[y1:y2, x1:x2]

We can demonstrate this by extracting each face and plotting them as separate subplots. You could just as easily save them to file. The draw_faces() below extracts and plots each detected face in a photograph.

# draw each face separately
def draw_faces(filename, result_list):
	# load the image
	data = pyplot.imread(filename)
	# plot each face as a subplot
	for i in range(len(result_list)):
		# get coordinates
		x1, y1, width, height = result_list[i]['box']
		x2, y2 = x1 + width, y1 + height
		# define subplot
		pyplot.subplot(1, len(result_list), i+1)
		pyplot.axis('off')
		# plot face
		pyplot.imshow(data[y1:y2, x1:x2])
	# show the plot
	pyplot.show()

The complete example demonstrating this function for the swim team photo is listed below.

# extract and plot each detected face in a photograph
from matplotlib import pyplot
from matplotlib.patches import Rectangle
from matplotlib.patches import Circle
from mtcnn.mtcnn import MTCNN


# draw each face separately
def draw_faces(filename, result_list):
	# load the image
	data = pyplot.imread(filename)
	# plot each face as a subplot
	for i in range(len(result_list)):
		# get coordinates
		x1, y1, width, height = result_list[i]['box']
		x2, y2 = x1 + width, y1 + height
		# define subplot
		pyplot.subplot(1, len(result_list), i+1)
		pyplot.axis('off')
		# plot face
		pyplot.imshow(data[y1:y2, x1:x2])
	# show the plot
	pyplot.show()


filename = 'test2.jpg'
# load image from file
pixels = pyplot.imread(filename)
# create the detector, using default weights
detector = MTCNN()
# detect faces in the image
faces = detector.detect_faces(pixels)
# display faces on the original image
draw_faces(filename, faces)

Running the example creates a plot that shows each separate face detected in the photograph of the swim team.

Plot of Each Separate Face Detected in a Photograph of a Swim Team