TPOT for Automated Machine Learning in Python

by Jason Brownlee  on September 9, 2020  in Python Machine Learning

Tweet  Share

Share

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user involvement.

TPOT is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Genetic Programming stochastic global search procedure to efficiently discover a top-performing model pipeline for a given dataset.

In this tutorial, you will discover how to use TPOT for AutoML with Scikit-Learn machine learning algorithms in Python.

After completing this tutorial, you will know:

  • TPOT is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
  • How to use TPOT to automatically discover top-performing models for classification tasks.
  • How to use TPOT to automatically discover top-performing models for regression tasks.

Let’s get started.

TPOT for Automated Machine Learning in PythonTPOT for Automated Machine Learning in Python

Photo by Gwen, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. TPOT for Automated Machine Learning
  2. Install and Use TPOT
  3. TPOT for Classification
  4. TPOT for Regression

TPOT for Automated Machine Learning

Tree-based Pipeline Optimization Tool, or TPOT for short, is a Python library for automated machine learning.

TPOT uses a tree-based structure to represent a model pipeline for a predictive modeling problem, including data preparation and modeling algorithms and model hyperparameters.

… an evolutionary algorithm called the Tree-based Pipeline Optimization Tool (TPOT) that automatically designs and optimizes machine learning pipelines.

— Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

An optimization procedure is then performed to find a tree structure that performs best for a given dataset. Specifically, a genetic programming algorithm, designed to perform a stochastic global optimization on programs represented as trees.

TPOT uses a version of genetic programming to automatically design and optimize a series of data transformations and machine learning models that attempt to maximize the classification accuracy for a given supervised learning data set.

— Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

The figure below taken from the TPOT paper shows the elements involved in the pipeline search, including data cleaning, feature selection, feature processing, feature construction, model selection, and hyperparameter optimization.

Overview of the TPOT Pipeline SearchOverview of the TPOT Pipeline Search

Taken from: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

Now that we are familiar with what TPOT is, let’s look at how we can install and use TPOT to find an effective model pipeline.

#python machine learning #python

TPOT for Automated Machine Learning in Python
1.25 GEEK