The Problem

The Santander Group is a global banking group, led by Banco Santander S.A., the largest bank in the euro area. It has its origin in Santander, Cantabria, Spain. As every bank, they have a retention program that should be applied to unsatisfied customers.

To be able to use this program properly, we need to develop a machine learning model to classify if the customer is satisfied or not. Customers classified as unsatisfied should be the target of the retention program.

The retention program cost $10 for each customer and an effective application (in really unsatisfied customers) returns a profit of $100. In the classification task we can have the following scenarios:

  1. False Positive(FP): classify the customer as UNSATISFIED but he is SATISFIED. Cost: $ 10, Earn: $ 0;
  2. False Negative(FN): classify the customer as SATISFIED but he is DISSATISFIED. Cost: $ 0, Earn: $ 0;
  3. True Positive(TP): classify the customer as UNSATISFIED and he is UNSATISFIED. Cost: $ 10, Earn: $ 100;
  4. True Negative(TN): classify the customer as SATISFIED and he is SATISFIED. Cost: $ 0, Earn: $ 0.

In summary, we want to minimize the rate of FP and FN as well as maximize the rate of TP. To do so, we will use the metric AUC (area under the curve) of ROC Curve (receiver operating characteristic), because it returns us the best model as well as the best threshold.

You can check the complete notebook with this solution on my Github.

This Case was made as a parte of tht prize for winning the Santander Data Masters Competition. I explain more about the competition itself and the hard skills I learned and soft skills I used in my way to winning it in this article.

Let’s go.

#programming #artificial-intelligence #data-science #machine-learning #classification

Santander Case — Part A: Classification
1.15 GEEK