Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

It’s cliché to say that data cleaning accounts for 80% of a data scientist’s job, but it’s directionally true.

That’s too bad, because fun things like data exploration, visualization and modelling are the reason most people get into data science. So it’s a good thing that there’s a major push underway in industry to automate data cleaning as much as possible.

One of the leaders of that effort is Ihab Ilyas, a professor at the University of Waterloo and founder of two companies, Tamr and Inductiv, both of which are focused on the early stages of the data science lifecycle: data cleaning and data integration. Ihab knows an awful lot about data cleaning and data engineering, and has some really great insights to share about the future direction of the space — including what work is left for data scientists, once you automate away data cleaning.

Here were some of my biggest takeaways from the conversation:

  • Data cleaning involves a lot of things, one of which is dealing with missing values. Historically, missing values have often been filled in manually by subject matter experts who can make educated guesses about the data, but automated techniques can work well (and usually do better) at scale.
  • These automated strategies can range from fairly naive approaches (e.g. replacing a value with the median or average value of other points in the dataset), to more sophisticated techniques (e.g. using a predictive model to guess at missing values).

#data-science #podcast #data-engineering #tds-podcast #towards-data-science #data science

Data cleaning is finally being automated
1.15 GEEK