Ron is a retiree on a fixed-income. He recently applied for a car loan but was rejected. Rather annoyed, he is demanding an explanation for his rejection.
You are the loan specialist and are responsible for reviewing the automated decision with Ron. You are eager to help him since getting what he wants also benefits your business. You meet with Ron to review his rejection in detail and to see what advice you can provide.
The data set for this example is the Loan Default Data available from Kaggle. The data consists of 10 features and a target output of “serious delinquency past 90 days”. We will use the target as a recommendation for loan approval. That is, if the prediction for “serious delinquency” is true then the loan is rejected, otherwise it is approved.
A Snippet of the Loan Default Dataset
Per our discussion in part 1, we should first see if there are any features that are highly correlated.
Correlation Matrix of all variables
The data shows three instances of high correlation:
In other words, the data shows that late debt payments are highly correlated between all the time frames (i.e., 30–59 days, 60–89 days, or 90 days).
In order to deal with these correlations, both grouping and elimination strategies were tried (i.e., aggregating all 3 variables into one, and dropping one or more of the highly correlated variables). However, no significant differences were found in how they affected the predictions. To simplify the interpretation of the explanation downstream, I eliminated the features with the lower correlation to the target variable which resulted in retaining the feature “30–59DaysLate”.
The Random Forest algorithm from scikit-learn was used to fit a model with these features for a prediction of loan approval (Accuracy = 0.93). The relative importance of the features are as follows.
Feature Importance for Loan Approval
Feature Importance analysis shows that the most important feature for our model is “Utilization”. This feature is defined as the “total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits”. In other words, a ratio of credit balance over credit limit. The importance of this feature is closely followed by debt ratio and monthly income. Let’s plot these features using our modified ICE plot to see what is going on at a more detailed level for our individual instance.
The top 6 features visualized using our modified ICE plot are as follows. I’m using the pyCEbox package in Python for the calculation of ICE values with modifications on the visualization functions as described in part 1. The code for this can be found on GitHub.
Recall from part 1 that:
Let’s examine these sub-plots in turn. Reminder that a prediction of 1 is good (loan approved) and 0 is bad (loan declined).
Utilization: The ICE plot shows that there is a substantial turbulence up to a value of 1.0 of the x-axis; after which there is more volatility at a lower level of probability, between 1.0–1.1, before stabilizing. The PD curve reflects this, but shows clearly that there is a drop around a value of 1.0.
The ICE curve for Ron (in blue with round markers) follows the PD curve pattern more or less up to a value of 1.0 in the x-axis, after which there is a precipitous drop that is much lower than the PD curve. Moreover, as shown by the yellow diamond marker, the prediction for Ron is near the bottom of the ICE curve for this feature. Since this is the most important feature, it is an important partial explanation as to why Ron’s loan application was declined.
#data-visualization #interpretable-ai #data analysis