
Picking the right machine learning algorithm is decisive, where it decides the performance of the model. The most dominating factor in choosing a model is the performance, which employs the KFold-cross-validation technique to achieve independence.

The chosen model usually has a higher mean performance. Nevertheless, sometimes it originated through a statistical fluke. There are many statistical hypothesis-testing approaches to evaluate the mean performance difference resulting from the cross-validation to address this concern. If the difference is above the significance level **p-value** we can reject the null hypothesis that the two algorithms are the same, and the difference is not significant.

I usually include such a step in my pipeline either when developing a new classification model or competing in one of Kaggle’s competitions.

Tutorial Objectives

  1. Understanding the difference between statistical hypothesis tests.
  2. Model selection based on the mean performance score could be misleading.
  3. Why using the Paired Student’s t-test over the original Student’s t-test.
  4. Applying the advance technique of 5X2 fold by utilizing the MLxtend library for comparing the algorithms based on p-value

Table of content

  1. What does the statistical significance testing mean?
  2. Types of commonly used statistical hypothesis testings
  3. Extract the best two models based on performance.
  4. Steps to conduct hypothesis testing on the best two
  5. Steps to apply the 5X2 fold
  6. Comparing Classifier algorithms
  7. Summary
  8. References

#statistics #machine-learning #python #classification-algorithms #hypothesis-testing

Evaluate ML Classifier Performance using Statistical Hypothesis Testing in Python
1.30 GEEK