ML Programming Hacks that every Data Engineer should know

ML Programming Hacks that every Data Engineer should know. A wider Cheatsheet for Data Scientist & Machine Learning practitioners out there.

What Constitutes a Perfect Data Team?

What Constitutes a Perfect Data Team? A guide to comprehend who are the members of a Data Team and what are the key roles for each of them!

WebSockets: Lesser Known Pattern in Data Engineering

Learn how to use the API approach to enable full duplex data transfer between client and server asynchronously using WebSockets which is an upgrade from HTTP, with a working Python code.

Kafka Concepts

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Airflow in Docker Metrics Reporting

To start, I’ll assume basic understanding of Airflow functionality and containerization using Docker and Docker Compose. More resources can be found here for Airflow, here for Docker, and here for Docker Compose.

How to orcestrate Jupyter notebooks?

Orchestrate Jupyter Notebooks in 5 minutes. Replacing Airflow for notebook users by Bayesnote. Jupyter notebook is the most popular interactive development environment among data scientists.

A Data Warehouse Implementation on AWS

In this post, will to show you an implementation of Data Warehouse on AWS based on a case study performed a couple of months ago. This implementation uses AWS S3 as the Data Lake (DL). AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the Data Warehouse (DW).

What is a Data Warehouse: Basic Architecture

A Data Warehouse is a component where your data is centralized, organized, and structured according to your organization's needs. It is used for data analysis and BI processes.

5 Sets Algorithms to Solve Before Your Python Coding Screen

How much do you know about Sets in Python? Challenge yourself with these ‘Easy’ and ‘Medium’ LeetCode problems. In one recent article, I presented and shared the solution for a number of Python algorithms that I have been challenged with in real interviews.

Scraping a table in a PDF, reliably and then test data quality

In this article I’m going to walk you through how you can scrape a table embedded in a PDF file, unit test that data using Great Expectations and then if valid, save the file in S3 on AWS.

Reading Parquet files with AWS Lambda

In this article we will learn how to read Parquet files with AWS Lambda. Thinking to use AWS Lambda, I was looking at options of how to read parquet files within lambda until I stumbled upon AWS Data Wrangler.

Implementing Glue ETL job with Job Bookmarks

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs. In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture.

Fundamentals of Data Architecture to Help Data Scientists Understand Architectural Better

Fundamentals of Data Architecture to Help Data Scientists Understand Architectural Diagrams Better.

GCP Serverless Design Pattern: Adhering to rate & concurrency limits with Cloud Tasks

This post aims to shed some light on the use case for Cloud Tasks . Even though I consider myself quite knowledgeable about the multiple GCP products related to data engineering, I had not heard of a use case for Cloud Tasks before.

Introduction to Great Expectations, an Open Source Data Science Tool

Introduction to Great Expectations, an Open Source Data Science Tool. This is the first completed webinar of our “Great Expectations 101” series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.

How to deploy Airflow on AWS: best practices

Deploying Airflow on AWS is quite a challenge for those who don't have DevOps experience, that is, almost everyone who works in data.

Top 10 most used Pandas features in a Data Science Project

There are many tutorials about pandas on the internet and books. Pandas library is one of the most used libraries by data scientists and data engineers. In this tutorial, I am going to list the pandas’ functions I use the most. Top 10 most used Pandas features in a Data Science Project

Grouping List of Dictionaries By Specific Key(s) in Python

The itertools function in Python provides an efficient way for looping lists, tuples and dictionaries. The itertools.groupby function in itertools will be applied in this tutorial to group a list of dictionaries by a particular key. Grouping List of Dictionaries By Specific Key(s) in Python

JavaScript for Data Engineers

The latest StackOverflow Developer survey deemed JavaScript as the most popular technology closely followed by SQL as the third most popular technology. The former was considered to be a client-side scripting/front end language until a number of years ago when JavaScript based servers got widespread attention.

Ensemble Feature Selection in Machine Learning by OptimalFlow

Use OptimalFlow’s autoFS module to implement ensemble feature selection, which simplifies this process easily. Why we use OptimalFlow? You could read another story of its introduction: “An Omni-ensemble Automated Machine Learning — OptimalFlow”.