Applying Machine Learning To The E.Coli Class Imbalance Dataset

The E.Coli dataset is a very popular dataset to experiment on because it is a multi-classification that has several imbalances. The E.coli dataset is such a difficult dataset to find a solution for that I have not been able to find a lot written about it on the internet.

Jason Brownlee, of masteringmachinelearning.com suggested deleting the rows deleting the rows of the highly imbalance classes, but in my opinion such a practice defeats the purpose of endeavouring to make predictions.

After much exhaustive research I was able to come up with a solution where all eight classes in the dataset were identified and predicted on.

The E.coli dataset is credited to Kenta Nakai and was developed into its current form by Paul Horton and Kenta Nakai in their 1996 paper titled “A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins.” In it, they achieved a classification accuracy of 81%. It is comprised of 336 examples of E.coli proteins and each example is described using seven input variables calculated from the proteins amino acid sequence.

The first thing I did was to load the dataset and to label the columns because they were not labeled in the original dataset:-

#machine-learning #python #sklearn #class-imbalance #data-science

medium.com

Applying Machine Learning To The E.Coli Class Imbalance Dataset