Isolation Forest: A Tree-based Algorithm for Anomaly Detection

This is the 10th in a series of small, bite-sized articles I am writing about algorithms that are commonly used in anomaly detection (I’ll put links to all other articles towards the end). In today’s article, I’ll focus on a tree-based machine learning algorithm — Isolation Forest — that can efficiently isolate outliers from a multi-dimensional dataset.

My objective here is to give an intuition of how the algorithm works and how to implement it in a few lines of codes as a demonstration. So I am not going deep into the theory, but just enough to help readers understand the basics. You can always search and lookup for details if there’s a specific part of the algorithm that you are interested. So let’s dive right in!

What is Isolation Forest?

Isolation Forest or iForest is one of the more recent algorithms which was first proposed in 2008 [1] and later published in a paper in 2012 [2]. Around 2016 it was incorporated within the Python Scikit-Learn library.

It is a tree-based algorithm, built around the theory of decision trees and random forests. When presented with a dataset, the algorithm splits the data into two parts based on a random threshold value. This process continues recursively until each data point is isolated. Once the algorithm runs through the whole data, it filters the data points which took fewer steps than others to be isolated. Isolation Forest in sklearn is part of the Ensemble model class, it returns the anomaly score of each instance to measure abnormality.

#isolation-forests #outlier-detection #anomaly-detection #data-science #machine-learning

What is Isolation Forest?

towardsdatascience.com

Isolation Forest: A Tree-based Algorithm for Anomaly Detection