When we have an imbalanced (say %90 A’s, %10 B’s in the label) data set, we should be careful with the “train/test splitting” step(and also cross validation)

There are 3 things to do:

  • Split the sets such a way that your test set should have the same proportions in the whole set. Otherwise, by pure randomness, your test set may completely consist of A’s, which causes our model to predict only A’s. If we arrange the correct proportions our model can make predictions for B’s as well.
  • Do some oversampling for the minority class for a fair training, (and also undersampling for the majority class if needed.) Search for SMOTE, you’ll see how to do it.
  • Don’t just measure the accuracy, but also precision and recall. Think about it, you have such a bad model that it says all the instances are A. This doesn’t mean your model is %90 accurate. That’s why we have other metrics. There are tones of articles about this too. Do this, even if you did the previous step, as the previous step is for the sake of fair training, whereas this step is for the sake of fair measurement.

We’ll be focusing on the first step right now.

Above, we stressed the test set and explained the reason. That’s why we’re going to use only y_test below.

Some terminology:

  • Shuffling: Reordering the data and selection from here and there.
  • random_state: a parameter that enables us to see every time we run the script, and also requires shuffling.
  • n_split: how many subsets will be created.
  • test_size: apart from the purpose of cross-validation, how much of the data will be used as test set. Some of the classes have this parameter, some not. When not present, test indices are produced by the percentage “1/n_split”

Notebook

You could find the notebook where the following codes reside in here.

Data

Image for post

There are similar classes which differs in some ways. We’ll see into some of these now.

Here is the generic function that we’ll use below.

def showSplits(splitter,X,y):
	    i=0
	    for train_index, test_index in splitter.split(X, y):   
	        print(f"split no: {i}")    
	        y_train, y_test = y[train_index], y[test_index]
	        #print(len(y_test))
	        print(y_test.value_counts(normalize=True),end="\n\n")
	        print(test_index,end="\n\n--------------------------------\n\n")
	        i+=1

#train-test-split #python #imbalanced-data #stratification #testing

3 Steps In Case Of Imbalanced Data And Look At The Splitter Classes
1.45 GEEK