Intro to Data Engineering for Data Scientists

Intro to Data Engineering for  Data Scientists

Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews

During my past years of conducting interviews for my employers, I have found that fresh data science graduates often don’t have a good understanding of the data engineering side of the world. They typically swim in the ocean of machine learning algorithms and pay little attention to their work upstream and downstream. Yet, the experience of working with data engineers is highly valued by many companies as it is critical in terms of deploying data science models smoothly and efficiently. Understanding the bigger picture is important. It helps you to get a sense of all the components that a company needs in order to solve business problems and the role that you and your work are playing in the organization.

This article will help you grasp the various elements of data infrastructure and also get familiar with some common tools and software that are used in each step.

Let’s get started.


Basic Concepts

Here are some basic terminologies that you will often hear data engineers mention. Let’s first clear these concepts out of the way:

  • Server: think of servers as “remote computers” which can be accessed from your local computer (called clients) via APIs.
  • API: consider it as a URL or a library where you can access or execute it, to run a software or a model.
  • Microservice: the end system that companies are building. Instead of building a complicated and monolithic application to serve multiple business needs, engineers isolate software functionalities into multiple independent modules that are individually responsible for performing standalone tasks. These modules communicate with each other through universally accessible APIs. For a more detailed explanation, check out this link.

Image for post

Monolithic application: dev teams work on different functionalities of the same software

Image for post

Microservices: Build each function as a standalone application and access each other through API

  • Docker: Docker containers can be understood as places where you store the libraries and package up the applications that you need to run your code. It is handy in terms of maintaining consistency of package versions and eliminating the hassles of re-installing packages when running on different computers and servers. Docker containers are spun up using docker images — if you think docker container as an operating system (like your windows system, IOS system, or a virtual machine), the docker image is a snapshot of this system at a point of time. You have a docker image when you freshly start your docker container, and you can create additional docker images that store your work. When you share a docker image with others, they will be able to run everything exactly the same as you. For more details on how docker works, check out this link.

towards-data-science data-engineering data-science data-infrastructure data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Exploratory Data Analysis is a significant part of Data Science

Data science is omnipresent to advanced statistical and machine learning methods. For whatever length of time that there is data to analyse, the need to investigate is obvious.

How to Fix Your Data Quality Problem

Data quality is top of mind for every data professional — and for good reason. Bad data costs companies valuable time, resources, and most of all, revenue.

Data Analysis is a Prerequisite for Data Science. Here’s Why.

A closer look at data analytics for data scientists. With a changing landscape in the workforce, many people are either changing their careers or applying to different companies after being laid off.

Data Cleaning in R for Data Science

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis.