HDTree: A Customizable and Interactable Decision Tree Written in Python

HDTree: A Customizable and Interactable Decision Tree Written in Python

Introducing a customizable and interactable Decision Tree-Framework written in Python. This story will introduce yet another implementation of Decision Trees, which I wrote as part of my thesis.

Introducing a customizable and interactable Decision Tree-Framework written in Python

Fast Track

What’s inside the story?

This story will introduce yet another implementation of Decision Trees, which I wrote as part of my thesis. The work will be divided into three chapters as follows: 

Firstly, I will try to motivate why I have decided to take my time to come up with an own implementation of Decision Trees; I will list some of its featuresbut also will list the _disadvantages _of the current implementation. 

Secondly, I will guide you through the basic usage of HDTreeusing code snippets and explaining some details along the way.

Lastly, there will be some hints on how to customize and extend the _HDTree _with your own chunks of ideas.

However, this article will not *guide you through all of the *basics of Decision Trees. There are really plenty of resources out there [1][2][3][16]. I think there is no need in repeating all of that again. Others have done that. I will not be able to do it better. You don’t need to be an *expert *in Decision Trees to understand this article. A basic level of understanding should be sufficient to follow up. However, some experience in the ML domain is a plus.

Motivation & Background

For my work I came along working with Decision Trees. My actual goal is to implement an human-centric ML-model, where _HDTree _(Human Decision Tree for that matter) is an optional ingredient which is used as part of an actual user interface for that model. While this story solely focuses on HDTree, I might write a follow-up describing the other components in detail.

Features of HDTree & Comparison with scikit learn Decision Trees

Naturally, I stumbled upon the scikit-learn-implementation⁴ of decision trees. I guess many practitioners do. And lets make something clear from the beginning: nothing is wrong with it.

The sckit-learn implementation has a lot of pros:

  • it’s fast & optimized
  • The implementation is written in a dialect called C_ython_⁵. Cython compiles to C-Code (which in turn compiles to native code) while maintaining interoperability with the Python interpreter.
  • it’s easy and to use and convinient
  • Many people in the ML-domain know how to work with scikit-learn models. You will easily find help everywhere due to its user base.
  • it’s battle tested (a lot of people are using it)
  • It just works
  • it supports many pre-pruning and post-pruning [6] methods and provides many features (e.g., Minimal Cost-Complexity Pruning³ or sample weights)
  • it support basic visualization [7]

That said, surely it also has some shortcomings:

  • t’s not trivial to modify, partly due to the usage of the rather uncommon Cython dialect (see advantages above)
  • no way to incorporate user knowledge about the domain or to modify the learning process
  • the visualization is rather minimalistic
  • no support for categorical attributes / features
  • no support for missing values
  • interface for accessing nodes and traversing the tree is cumbersome an not intuitive
  • no support for missing values
  • only binary splits (see later)
  • no multivariate splits (see later)

Features HDTree

HDTree comes with a solution to most of the shortcomings mentioned in the above list, while sacrificing many of the advantages of the scikit-learn implementation. We will come back to those points later, so don’t worry if you do not understand every part of the following list yet:

👍 interact with the learning-behavior

👍 core components are modular and fairly easy to extend (implement an interface)

👍 purely written in Python (more approachable)

👍 rich visualization

👍 support categorical data

👍 support for missing values

👍 support for multivariate splits

👍 easy interface to navigate through the tree structure

👍 supports for** n-ary splits** (> 2 child nodes)

👍 textual representations of decision paths

👍 encourages explainability by printing human-readable text

👎 slow

👎 not battle-tested (it _will _have bugs)

👎 mediocre software quality

👎 not so many pruning options (it supports some basic options, though)

⚠️ Although the disadvantages seem to be not too numerous, they are critical. Let us make that clear right away: Do not throw big data at it. You will wait forever. Do not use it in production. It may break unexpectedly. You have been warned!⚠️

Some of these problems may get fixed over time. However, the training speed probably will remain slow (inference is okay, though). You will have to come up with a better solution to fix that. You are very welcome to contribute 😃.

That said, what would be possible use cases?

  • extract knowledge from your data
  • test the intuition you have about your data
  • understand the inner workings of decision trees
  • explore alternative causal relationships regarding to your learning problem
  • use it as part of your more complex algorithms
  • create reports and visualizations
  • use it for any research-related purposes
  • have an accessible platform to easily test your idea idea for decision tree algorithms

decision-tree machine-learning data-science data-visualization data analytic

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Visualize a Decision Tree in Machine Learning | Data Science | Machine Learning | Python

In the right side, we have a visualization of the output we get when we use a decision tree algorithm on data to predict the possibilities.

Top 40 Python Libraries for Data Science, Data Visualization & Machine Learning

This article compiles the 38 top Python libraries for data science, data visualization & machine learning,

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Visual Analytics and Advanced Data Visualization

Visual Analytics and Advanced Data Visualization - How CanvasJS help enterprises in creating custom Interactive and Analytical Dashboards for advanced visual analytics for data visualization