The brave new world of data engineering

Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

There’s been a lot of talk in data science circles about techniques like AutoML, which are dramatically reducing the time it takes for data scientists to train and tune models, and create reliable experiments. But that trend towards increased automation, greater robustness and reliability doesn’t end with machine learning: increasingly, companies are focusing their attention on automating earlier parts of the data lifecycle, including the critical task of data engineering.Today, many data engineers are unicorns: they not only have to understand the needs of their customers, but also how to work with data, and what software engineering tools and best practices to use to set up and monitor their pipelines. Pipeline monitoring in particular is time-consuming, and just as important, isn’t a particularly fun thing to do. Luckily, people like Sean Knapp — a former Googler turned founder of data engineering startup Ascend.io — are leading the charge to make automated data pipeline monitoring a reality.We had Sean on this latest episode of the Towards Data Science podcast to talk about data engineering: where it’s at, where it’s going, and what data scientists should really know about it to be prepared for the future. Here were my favourite take-homes:

A very large amount of a data engineer’s time is spent monitoring data pipelines for performance rather than building new pipelines or tooling. Because of that, and the complexity involved in building these pipelines with current technology, data engineering projects can have huge time horizons (many years in some cases). The result is a lot of frustrated data scientists, who have to wait far too long to get access to the data they need today. If that happens often enough, frustrations can reach a fever pitch, leading to what Sean calls a “data mutiny”.One great strategy to avoid a data mutiny is to build cross-functional teams, that force data engineers to work alongside data scientists and analysts, so that everyone is aware of the unique challenges that present themselves at each stage of the data lifecycle. As an added bonus, small, cross-functional teams can afford to focus more directly on the relevant business problem that they’re trying to tackle, since there’s enough knowledge on the team to understand it fully.If you’re trying to dip your toes into the data engineering water, whether as an aspiring data engineer or a data scientist who’s trying to prepare for the future, Sean recommends building projects that account for the possibility of regulatory compliance. For example, “right to be forgotten” legislation now allows European citizens to request that their data — and the impact that their data has had on any algorithms that consumed it — be withdrawn. So an interesting exercise to work into your next project might be: how do I design a pipeline in such a way as to be able to trace the influence that a particular data point will have on it, and erase that influence on demand?

#data-engineering #tds-podcast #data-science #towards-data-science #data analysis

towardsdatascience.com

The brave new world of data engineering