Lately, I have been reading about Isolation Forest and its performance in outlier/anomaly detection. After carefully reading about the algorithm (and not finding any vanilla tutorial of my taste). I decided to code it from scratch in the simplest possible way in order to grasp the algorithm better.

The goal of this post is that after reading it you can understand Isolation Forest in-depth, its strength, weakness, parameters and you are able to use it whenever you consider with knowledge of the algorithm.

For this blog/implementation, I have used this paper about Isolation forest for the pseudo-code, this Extended Isolation Forest paper for the visualizations (that corresponds with this other blog post) and using this youtube tutorial example of Random Forest implementation from Sebastian Mantey

A bit of theoretical background

Before getting to the code I believe that there is the need for a bit of theory.

What is Isolation Forest?

  • Isolation Forest is used for outlier/anomaly detection
  • Isolation Forest is an Unsupervised Learning technique (does not need label)
  • Uses Binary Decision Trees bagging (resembles Random Forest, in supervised learning)

Hypothesis

This method isolates anomalies from normal instances, for doing this the following assumptions for anomalies are made:

  • They are a minority consisting of fewer instances
  • They have attribute-values that are different from normal instances

In other words, anomalies are “few and different.”

Because of these first two assumptions, anomalies are susceptible to be isolated and this makes them fall closer to the root of the tree.

Brief description

Isolation Forest builds an ensemble of Binary Trees for a given dataset. Anomalies, due to their nature, they have the shortest path in the trees than normal instances.

Isolation Forest converges quickly with a very small number of trees and subsampling enables us to achieve good results while being computationally efficient.

The code

The overall code strategy will be the following. First coding a tree, then doing a forest of trees (ensembling) and finally measuring how far a certain instance goes in each tree and determining whether it is or not an outlier.

Let’s start with the tree

Isolation Tree: Coding the tree

Image for post

Isolation Tree pseudocode from [1]

The input will be a sample of the data, the current tree height, and the maximum depth.

For the output, we will have a built tree.

To make it easier to follow I am working with pandas data frames, even if it is not optimal in terms of performance, makes it easier to follow for regular users.

  • Selecting a feature(column) of the data
def select_feature(data): 
    return random.choice(data.columns)
  • Select a random value within the range
def select_value(data,feat):
    mini = data[feat].min()
    maxi = data[feat].max()
    return (maxi-mini)*np.random.random()+mini
  • Split data
def split_data(data, split_column, split_value):

    data_below = data[data[split_column] <= split_value]
    data_above = data[data[split_column] >  split_value]

    return data_below, data_above
  • All together: The Isolation Tree.

The idea is the following: selecting a feature, a value of the feature, and splitting the data. If there is only one data point in the branch or the tree has reached the maximum depth: stop.

def isolation_tree(data,counter=0, max_depth=50,random_subspace=False):

    # End Loop if max depth or isolated
    if (counter == max_depth) or data.shape[0]<=1:
        classification = classify_data(data)
        return classification

    else:
        # Counter
        counter +=1

        # Select feature
        split_column = select_feature(data)

        # Select value
        split_value = select_value(data,split_column)
        # Split data
        data_below, data_above = split_data(data,split_column,split_value)

        # instantiate sub-tree      
        question = "{} <= {}".format(split_column, split_value)
        sub_tree = {question: []}

        # Recursive part
        below_answer = isolation_tree(data_below, counter,max_depth=max_depth)
        above_answer = isolation_tree(data_above, counter,max_depth=max_depth)

        if below_answer == above_answer:
            sub_tree = below_answer
        else:
            sub_tree[question].append(below_answer)
            sub_tree[question].append(above_answer)

        return sub_tree

#outliers #machine-learning #unsupervised-learning #isolation-forests #deep learning

Isolation Forest from Scratch
11.45 GEEK