Evaluate ML Classifier Performance using Statistical Hypothesis Testing in Python

Introduction

Picking the right machine learning algorithm is decisive, where it decides the performance of the model. The most dominating factor in choosing a model is the performance, which employs the KFold-cross-validation technique to achieve independence.

The chosen model usually has a higher mean performance. Nevertheless, sometimes it originated through a statistical fluke. There are many statistical hypothesis-testing approaches to evaluate the mean performance difference resulting from the cross-validation to address this concern. If the difference is above the significance level **p-value** we can reject the null hypothesis that the two algorithms are the same, and the difference is not significant.

I usually include such a step in my pipeline either when developing a new classification model or competing in one of Kaggle’s competitions.

Tutorial Objectives

Understanding the difference between statistical hypothesis tests.
Model selection based on the mean performance score could be misleading.
Why using the Paired Student’s t-test over the original Student’s t-test.
Applying the advance technique of 5X2 fold by utilizing the MLxtend library for comparing the algorithms based on p-value

Table of content

What does the statistical significance testing mean?
Types of commonly used statistical hypothesis testings
Extract the best two models based on performance.
Steps to conduct hypothesis testing on the best two
Steps to apply the 5X2 fold
Comparing Classifier algorithms
Summary
References

#statistics #machine-learning #python #classification-algorithms #hypothesis-testing

Introduction

Tutorial Objectives

Table of content

towardsdatascience.com

Evaluate ML Classifier Performance using Statistical Hypothesis Testing in Python