How two critical issues could be explored and solved at the same time?
To explore these two problems, we will need a couple of tools and data:
!pip install shap
!pip install catboost
After importing the libraries required for this use-case, we can observe that the dataset is composed of 1470 entries, with 9 categorical features and 25 numerical ones.
import pandas as pd
from imblearn.under_sampling import ClusterCentroids
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from catboost import CatBoostClassifier
import shap
## The following lines should be used only on Google Colab
## to connect to your Google Drive
from google.colab import drive
drive.mount('/content/drive')
The 9th column “EmployeeNumber” is unique for each employee so we will use it as an index for our Pandas dataframe. This instruction is passed thanks to “index_col=9” when reading the CSV file:
df = pd.read_csv("./WA_Fn-UseC_-HR-Employee-Attrition.csv", index_col=9)
df.info()
There is no missing information within the whole dataset… this is clearly a synthetic one 😅.
#shap #machine-learning #catboost #python #data-science