Deep Dive Into Join Execution in Apache Spark

Deep Dive Into Join Execution in Apache Spark. This post is exclusively dedicated to each and every aspect of Join execution in Apache Spark.

Processing Engines for Big Data

This article is based on my previous article “Big Data Pipeline Recipe” where I tried to give a quick overview of all aspects of the Big Data world. Processing Engines for Big Data

Introduction: Data-driven workflows in Microsoft Azure Data Fatory.

Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the movement of data between supported data stores. It also lets you process data by using compute services in other regions or in an on-premises environment.

Open-Source Data Integration and ETL in 2020

In this article, we want to analyze the first point: the landscape of open-source data integration technologies.

Data Engineering — How to Build an ETL Pipeline Using SSIS in Visual Studio 2019

In this article, I will discuss how this can be done using Visual Studio 2019. You can also just clone the GitHub project and use it as your SSIS starter project.

Keep your data clean with data testing

In this article, we will focus on Python code and use the great-expectations package for testing. We will concentrate on Pandas DataFrames, but tests for PySpark and other tools are also supported by great-expectations.

Open-Source vs. Commercial Software: How To Better Solve Data Integration

Breakdown of a DBT Slack debate on the state of open-source alternatives to Fivetran and whether an OSS approach is more relevant than commercial software. In this article, we want to discuss the second point and go over the different points mentioned by each party. The first point will come in another article.

Big Data Engineering — Flowman up and running

Big Data Engineering — Flowman up and running. See the open source, Spark based ETL tool called Flowman in action right on your machine.

PySpark ETL from MySQL and MongoDB to Cassandra

In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries.

AWS Glue Vs. EMR: Differentiating two of the best ETL platforms

Amazon Web Services provide two service options capable of performing ETL: Glue and Elastic MapReduce (EMR). If they both do a similar job, why would you choose one over the other? This article details some fundamental differences between the two.

What is Persistent ETL and Why Does it Matter?

If you’ve made it to this blog you’ve probably heard the term “persistent” thrown around with ETL, and are curious about what they really mean together.

An introduction to SSIS Data Lineage concepts

This article introduces various SSIS Data Lineage concepts using practical demonstrations. In this article, I am going to discuss SSIS data lineage concepts.

Understanding Data Lineage in ETL

This article discusses the concepts of data lineage within an ETL pipeline. In this article, I am going to explain what Data Lineage in ETL is and how to implement the same.

Google Announces a New, More Services-Based Architecture Called Runner V2

Google Cloud Dataflow is a fully-managed service for executing Apache Beam pipelines within the Google Cloud Platform(GCP). In a recent blog post, Google announced a new, more services-based architecture called Runner v2 to Dataflow – which will include multi-language support for all of its language SDKs.

How to Set up a COVID-19 Workflow and Dashboard Using the Google Cloud

- perspectives of a beginner coder. Recommended Prerequisites: Understanding of Python (pandas & working with dataframes), Jupyter Notebook, SQL table operations and GitHub

Data engineering in 2020

In this post, I will talk about the evolution of data engineering and what skills “traditional” data developers might need to learn today (Hint: it is not Hadoop).

How to write ETL operations in Python

Clean and transform raw data into an ingestible format using Python. In this article, you’ll learn how to work with Excel/CSV files in a Python environment to clean and transform raw data into a more ingestible format. This is typically useful for data integration.

Uploading SQL data into Azure Blob Storage using SSIS

In this article, you will learn to configure Azure Blob Upload task of SQL Server Integration Services to upload the output of a SQL Query that is stored in an excel file on Azure Blob storage container.

Table Design Best Practices for ETL

Not so far ago, the approach taken to table design in source systems (application databases) used to be — we don’t care about ETL. Figure it out, we’ll concentrate on building the application.

Amazon Announces the General Availability of AWS Glue 2.0

AWS Glue is a fully-managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. With AWS Glue, customers don’t have to provision or manage any resources, and only pay for resources when the service is running.