Diabetes Prediction Model Explanation using LIME

Introduction

In the supervised machine learning world, there are two types of algorithmic task often performed. One is called regression (predicting continuous values) and the other is called classification (predicting discrete values). Black box algorithms such as SVM, random forest, boosted trees, neural networks provide better prediction accuracy than conventional algorithms. The problem starts when we want to understand the impact (magnitude and direction) of different variables. In this article, I have presented an example of Random Forest binary classification algorithm and its interpretation at the global and local level using Local Interpretable Model-agnostic Explanations (LIME).

Data Background

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

#data-science #python #machine-learning #model-explanation #lime

Introduction

Data Background

towardsdatascience.com

Diabetes Prediction Model Explanation using LIME