ML Programming Hacks that every Data Engineer should know. A wider Cheatsheet for Data Scientist & Machine Learning practitioners out there.
What Constitutes a Perfect Data Team? A guide to comprehend who are the members of a Data Team and what are the key roles for each of them!
Learn how to use the API approach to enable full duplex data transfer between client and server asynchronously using WebSockets which is an upgrade from HTTP, with a working Python code.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
To start, I’ll assume basic understanding of Airflow functionality and containerization using Docker and Docker Compose. More resources can be found here for Airflow, here for Docker, and here for Docker Compose.
Orchestrate Jupyter Notebooks in 5 minutes. Replacing Airflow for notebook users by Bayesnote. Jupyter notebook is the most popular interactive development environment among data scientists.
In this post, will to show you an implementation of Data Warehouse on AWS based on a case study performed a couple of months ago. This implementation uses AWS S3 as the Data Lake (DL). AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the Data Warehouse (DW).
A Data Warehouse is a component where your data is centralized, organized, and structured according to your organization's needs. It is used for data analysis and BI processes.
How much do you know about Sets in Python? Challenge yourself with these ‘Easy’ and ‘Medium’ LeetCode problems. In one recent article, I presented and shared the solution for a number of Python algorithms that I have been challenged with in real interviews.
In this article I’m going to walk you through how you can scrape a table embedded in a PDF file, unit test that data using Great Expectations and then if valid, save the file in S3 on AWS.
In this article we will learn how to read Parquet files with AWS Lambda. Thinking to use AWS Lambda, I was looking at options of how to read parquet files within lambda until I stumbled upon AWS Data Wrangler.
AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs. In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture.
Fundamentals of Data Architecture to Help Data Scientists Understand Architectural Diagrams Better.
This post aims to shed some light on the use case for Cloud Tasks . Even though I consider myself quite knowledgeable about the multiple GCP products related to data engineering, I had not heard of a use case for Cloud Tasks before.
Introduction to Great Expectations, an Open Source Data Science Tool. This is the first completed webinar of our “Great Expectations 101” series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.
Deploying Airflow on AWS is quite a challenge for those who don't have DevOps experience, that is, almost everyone who works in data.
There are many tutorials about pandas on the internet and books. Pandas library is one of the most used libraries by data scientists and data engineers. In this tutorial, I am going to list the pandas’ functions I use the most. Top 10 most used Pandas features in a Data Science Project
The itertools function in Python provides an efficient way for looping lists, tuples and dictionaries. The itertools.groupby function in itertools will be applied in this tutorial to group a list of dictionaries by a particular key. Grouping List of Dictionaries By Specific Key(s) in Python
Use OptimalFlow’s autoFS module to implement ensemble feature selection, which simplifies this process easily. Why we use OptimalFlow? You could read another story of its introduction: “An Omni-ensemble Automated Machine Learning — OptimalFlow”.