Machine learning (ML) systems are complex, and the more complex a system is, the more failure modes there are. Knowing what can go wrong is essential for building robust ML systems. Together, we will explore possible pitfalls that can occur at 5 different maturity levels, using concrete examples.

Level 0 — Problem definition

Level 1 — Your first ML model

Level 2 — Generalization

Level 3 — System-level performance

Level 4 — Performance is not the o_utcome_

The last part of this post will be on how to avoid these pitfalls. It is better to focus on avoiding pitfalls than squeezing every bit of accuracy out of your model.

“It is remarkable how much long-term advantage people like us have gotten by trying to be consistently not stupid, instead of trying to be very intelligent.”

— Charlie Munger

Level 0 Problem definition

Real-world problems rarely manifest themselves as tractable data science problems. Therefore for any ML project, the very first step is to formulate the problem, in other words converting a high-level goal into a well-defined data science problem.

The greatest threat at this level is to come up with a problem definition that when solved won’t actually help anyone. Two examples from this article:

1) Most studies applying deep learning to echocardiogram analysis try to surpass a physician’s ability to predict disease. But predicting normal heart function would actually save cardiologists more time by identifying patients who do not need their expertise.

2) Many studies applying machine learning to viticulture aim to optimize grape yields, but winemakers “want the right levels of sugar and acid, not just lots of big watery berries”.

Level 1 Your first ML model

This level is about making an ML model work on a test set resting on your laptop. This is no easy task, but most of the content out there already focus on this point therefore I will only mention one pitfall:

Pitfall 1.1 Assuming more data will solve all of your problems

Irrelevant features or low-quality data will decrease the upper bound performance you can reach, which cannot be addressed with more data.

Underfitting due to low model capacity is another case where more data won’t help.

#data-science #machine-learning #editors-pick

Machine learning pitfalls
1.10 GEEK