Lately, I have been reading about Isolation Forest and its performance in outlier/anomaly detection. After carefully reading about the algorithm (and not finding any vanilla tutorial of my taste). I decided to code it from scratch in the simplest possible way in order to grasp the algorithm better.
The goal of this post is that after reading it you can understand Isolation Forest in-depth, its strength, weakness, parameters and you are able to use it whenever you consider with knowledge of the algorithm.
For this blog/implementation, I have used this paper about Isolation forest for the pseudo-code, this Extended Isolation Forest paper for the visualizations (that corresponds with this other blog post) and using this youtube tutorial example of Random Forest implementation from Sebastian Mantey
Before getting to the code I believe that there is the need for a bit of theory.
This method isolates anomalies from normal instances, for doing this the following assumptions for anomalies are made:
In other words, anomalies are “few and different.”
Because of these first two assumptions, anomalies are susceptible to be isolated and this makes them fall closer to the root of the tree.
Isolation Forest builds an ensemble of Binary Trees for a given dataset. Anomalies, due to their nature, they have the shortest path in the trees than normal instances.
Isolation Forest converges quickly with a very small number of trees and subsampling enables us to achieve good results while being computationally efficient.
The overall code strategy will be the following. First coding a tree, then doing a forest of trees (ensembling) and finally measuring how far a certain instance goes in each tree and determining whether it is or not an outlier.
Let’s start with the tree
Isolation Tree pseudocode from [1]
The input will be a sample of the data, the current tree height, and the maximum depth.
For the output, we will have a built tree.
To make it easier to follow I am working with pandas data frames, even if it is not optimal in terms of performance, makes it easier to follow for regular users.
def select_feature(data):
return random.choice(data.columns)
def select_value(data,feat):
mini = data[feat].min()
maxi = data[feat].max()
return (maxi-mini)*np.random.random()+mini
def split_data(data, split_column, split_value):
data_below = data[data[split_column] <= split_value]
data_above = data[data[split_column] > split_value]
return data_below, data_above
The idea is the following: selecting a feature, a value of the feature, and splitting the data. If there is only one data point in the branch or the tree has reached the maximum depth: stop.
def isolation_tree(data,counter=0, max_depth=50,random_subspace=False):
# End Loop if max depth or isolated
if (counter == max_depth) or data.shape[0]<=1:
classification = classify_data(data)
return classification
else:
# Counter
counter +=1
# Select feature
split_column = select_feature(data)
# Select value
split_value = select_value(data,split_column)
# Split data
data_below, data_above = split_data(data,split_column,split_value)
# instantiate sub-tree
question = "{} <= {}".format(split_column, split_value)
sub_tree = {question: []}
# Recursive part
below_answer = isolation_tree(data_below, counter,max_depth=max_depth)
above_answer = isolation_tree(data_above, counter,max_depth=max_depth)
if below_answer == above_answer:
sub_tree = below_answer
else:
sub_tree[question].append(below_answer)
sub_tree[question].append(above_answer)
return sub_tree
#outliers #machine-learning #unsupervised-learning #isolation-forests #deep learning