Scenario

Image for post

Ron is a retiree on a fixed-income. He recently applied for a car loan but was rejected. Rather annoyed, he is demanding an explanation for his rejection.

You are the loan specialist and are responsible for reviewing the automated decision with Ron. You are eager to help him since getting what he wants also benefits your business. You meet with Ron to review his rejection in detail and to see what advice you can provide.

Data: Loan Default

The data set for this example is the Loan Default Data available from Kaggle. The data consists of 10 features and a target output of “serious delinquency past 90 days”. We will use the target as a recommendation for loan approval. That is, if the prediction for “serious delinquency” is true then the loan is rejected, otherwise it is approved.

Image for post

A Snippet of the Loan Default Dataset

Data Explorations

Per our discussion in part 1, we should first see if there are any features that are highly correlated.

Image for post

Correlation Matrix of all variables

The data shows three instances of high correlation:

30–59DaysLate and 90DaysLate: 0.98
30–59DaysLate and 60–89DaysPastDue: 0.98
90DaysLate and 60–89DaysPastDue: 0.99

In other words, the data shows that late debt payments are highly correlated between all the time frames (i.e., 30–59 days, 60–89 days, or 90 days).

In order to deal with these correlations, both grouping and elimination strategies were tried (i.e., aggregating all 3 variables into one, and dropping one or more of the highly correlated variables). However, no significant differences were found in how they affected the predictions. To simplify the interpretation of the explanation downstream, I eliminated the features with the lower correlation to the target variable which resulted in retaining the feature “30–59DaysLate”.

The Random Forest algorithm from scikit-learn was used to fit a model with these features for a prediction of loan approval (Accuracy = 0.93). The relative importance of the features are as follows.

Image for post

Feature Importance for Loan Approval

Feature Importance analysis shows that the most important feature for our model is “Utilization”. This feature is defined as the “total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits”. In other words, a ratio of credit balance over credit limit. The importance of this feature is closely followed by debt ratio and monthly income. Let’s plot these features using our modified ICE plot to see what is going on at a more detailed level for our individual instance.

ICE Plots

The top 6 features visualized using our modified ICE plot are as follows. I’m using the pyCEbox package in Python for the calculation of ICE values with modifications on the visualization functions as described in part 1. The code for this can be found on GitHub.

Image for post

Recall from part 1 that:

The thicker black line is the PD curve (recall this is the average of all the instances).
All other lines are ICE curves for individual instances. Each sub-plot here shows 602 individual instances from the dataset.
The blue line with markers is the instance of interest.
The yellow diamond shaped marker on the blue line shows the current feature value and the prediction for the instance we are interested in.

Let’s examine these sub-plots in turn. Reminder that a prediction of 1 is good (loan approved) and 0 is bad (loan declined).

Utilization: The ICE plot shows that there is a substantial turbulence up to a value of 1.0 of the x-axis; after which there is more volatility at a lower level of probability, between 1.0–1.1, before stabilizing. The PD curve reflects this, but shows clearly that there is a drop around a value of 1.0.

The ICE curve for Ron (in blue with round markers) follows the PD curve pattern more or less up to a value of 1.0 in the x-axis, after which there is a precipitous drop that is much lower than the PD curve. Moreover, as shown by the yellow diamond marker, the prediction for Ron is near the bottom of the ICE curve for this feature. Since this is the most important feature, it is an important partial explanation as to why Ron’s loan application was declined.

#data-visualization #interpretable-ai #data analysis

Scenario

Data: Loan Default

Data Explorations

ICE Plots

towardsdatascience.com

How to Explain and Affect Individual Decisions with ICE Curves (2/2)