So…data engineering again! Last week I participated in a  Kaggle competition on Mechanisms of Action Prediction. (This competition is still on going, try it if you want!) Basically it asks you to train an algorithm to classify drugs based on their biological activity, and I want to share with you now some quite useful (and simple!) techniques to improve accuracy for tabular data I learned in this competition. Hope it helps!

Disclaimer: the following content is largely thanks to this amazing notebook. Make sure to check it for implementation details!

Rank Gauss

Various, numerous, and many machine learning techniques, such as PCA, actually rely on the assumption that the underlying data is normally distributed. With a dataset that is not normally distributed, the usefulness of such techniques will be negatively affected. RankGauss, a technique developed by Michael Jahrer in  Porto Srguro’s Safe Driver Prediction, is a solution to such problem. I will not dive into the math behind it (because I am not good at Math!) but the purpose of RankGauss is to transform a non-normal distribution to a normal distribution, and it usually works better than standard standardization. Let us see the effect first!

#kaggle #data-science #neural-networks #machine-learning #deep-learning

Simple Data Engineering to Improve Your Machine Learning Results
4.20 GEEK