Kaggling: A Journey of Past Competitions

1. Jigsaw Multilingual Toxic Comment Classification Competition

In Jigsaw competition, Cross-validation, postprocessing, and preprocessing played a lot of importance.

Pseudo labeling: performance improvement when we used test-set predictions as training data — the intuition is that it helps models learn the test set distribution. Using all test-set predictions as soft-labels worked better than any other version of pseudo-labeling (e.g., hard labels, confidence thresholded PLs, etc.). Towards the end of the competition, we discovered a minor but material boost in LB. To know more about Pseudo labeling, best explained here by Chris from whom I learned a lot.
k-fold CV and validation set as hold-out but as refined test predictions and used pseudo-labels + validation set for training, the validation metric became noisy to the point where we relied primarily on the public LB score.
Postprocessing: the history of submissions and tweaking the test set predictions. Then tracking the delta of predictions for each sample for successful submissions, averaging them, and nudging the predictions in the same direction made the way for winning solutions.

TReNDS Neuroimaging Competition

In this competition reading, MRI data was a bit tedious. So preprocessing the data for constructed features, postprocessing and TLDR played a major role.

Preprocessing: Adding bias to different columns in the test set to make it closer to the train set. Linear models showed good performance in the competition so it was expected that adding biases would help a lot (at least for linear models). There are lots of ways to figure out possible biases: we used the minimization of Kolmogorov-Smirnov test’s statistic between train and test dataset. To know more about the Kolmogorov-Smirnov(KS) test’s statistic here is a notebook.
Postprocessing: Postprocessing used the same logic as a preprocessing: That is calculating the KS statistic and finding the best fit. But the effect of postprocessing was small: only 5e-5 for both public and private.
TLDR: Incremental PCA for 3d images.Offsets for test features (like we did in ION).

#deep-learning #solutions #neural-networks #artificial-intelligence #kaggle

1. Jigsaw Multilingual Toxic Comment Classification Competition

TReNDS Neuroimaging Competition

medium.com

Kaggling: A Journey of Past Competitions — Part1