Decision Tree-Based Diabetes Classification in R

What is a decision tree?

A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.

In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.

Image for post

A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.

Why use them?

A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.

Data Background

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

#hyperparameter #decision-tree #tuning #machine-learning #diabetes #deep learning

What is a decision tree?

Why use them?

Data Background

towardsdatascience.com

Decision Tree-Based Diabetes Classification in R