Sampling isn’t enough, profile your ML data instead

It’s 2020 and most of us still don’t know when, where, why, or how our models go wrong in production. While we all know that “what can go wrong, will go wrong,” or that “the best-laid plans of mice and [data scientists] often go awry,” complicated models and data pipelines are all too often pushed to production with little attention paid to diagnosing the inevitable unforeseen failures.

In traditional software, logging and instrumentation have been adopted as standard practice to create transparency and make sense of the health of a complex system. When it comes to AI applications, logging is often spotty and incomplete. In this post, we outline different approaches to ML logging, comparing and contrasting them. Finally, we offer an open source library called WhyLogs that enables data logging and profiling only in a few lines of code.

#data-science #logging #machine-learning #sampling #mlops

towardsdatascience.com

Sampling isn’t enough, profile your ML data instead