About a decade ago, when the data science jobs started going mainstream, there was a flood of opportunities in the tech world. However, most companies didn’t understand what to make of it. At one of my earlier stints, I used to hear phrases repeatedly, _we’re doing big data _and _we’re doing data science. _Because it was advertised that data scientists get big paychecks, data analysts, database administrators, data engineers — all of the wanted to be data scientists; without an understanding of what it requires to be one.

This is not the age of specialization. One needs to be a generalist who specialises in something. Just like life. One can be a neurosurgeon and still drive a car. It’s not odd to find a data engineer and a data scientist both in the same person, but it’s highly unlikely to see it in practice because it’s too broad an area of responsibility. Similarly, it’s highly unlikely to find a neurosurgeon at night who drives an Uber during the day.

Specialization is for insects — Robert A. Heinlein

Being a data engineer and a data scientist, both in one, also comes with a challenge of diving into the vast ocean of knowledge in both these fields related to data. A data engineer should be able to do basic data sciency stuff and a data scientist should be able to do basic data engineering. The same can be said about other fields of software. As in, the data engineer should be able to do basic frontend work and so on.

Having said that, it’s not so much that the skill is the distinguisher between all these fields, rather it is the thought process.

It doesn’t matter so much what you think, but how you think it — Christopher Hitchens

Plumbers Or Not

One of my managers used to make an interesting analogy of data engineering with plumbing. Data engineers move data from one place to another. Just like a cooking gas or drinking water need a pipeline to move from the plant to your house, the data needs a pipeline to move from one system to another. At the risk of sounding rude and engineer-splaining, I don’t want to carry forward with this analogy but it is rather true if you think about it.

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity — Dave Bianco

Data engineers are plumbers. But they are also more than that. In addition to making sure that data is transported from one place to another, data engineers make sure that the quality of data is good for use.

They also gauge how the data is going to be used and based on that they make decisions on how to store it, how best to retrieve it, to process it and so on. Some examples are choosing between traditional relational databases, data warehouses and NoSQL data stores or choosing between columnar and row-oriented data stores, choosing task schedulers, choosing data processing infrastructure.

While a data engineer might be a plumber, a data scientist is the one who accesses the water through the plumbed pipes and makes lemonade.

Read Robert Chang’s three piece introduction to data engineering.

Probabilistic vs. Deterministic Thinking

Let’s come to the main point of difference between a data engineer and a data scientist. Obviously, the job titles are different, the KRAs are different but they can surely overlap. The main quality that distinguishes these two creatures is how they think.

A data engineer thinks in terms of movement, strictness, predictability, cleanliness and resilience — of the data and, of the systems carrying the data.

There’s a striking difference between how these two approach handling data — movement of data, for example, should have the quality of being deterministic. If some data is supposed to arrive from one location to another, it should. If a transformation was to be applied to a dataset for cleaning or modification, it should happen. Data engineering, in that sense, should be predictable, dependable, resilient — Deterministic.

#software-development #thinking #data-science #data-engineering #machine-learning #data analysis

How To Think About Data
1.40 GEEK