Data Lake Change Data Capture (CDC) using Amazon Database Migration Service

Data Lake Change Data Capture (CDC) using Amazon Database Migration Service

Easily capture data changes over time from your database to Data Lake using Amazon Database Migration Service (DMS).

Over my past 10 years spent in the Big Data and Analytics world, I have come to realize that capturing and processing change data sets has been a challenging area. Through all these years I have seen how CDC has evolved. Let me take you through the journey:

Year 2011–2013 — For many, Hadoop is the major Data Analytics Platform. Typically, Sqoop was used to transfer data from a given database to HDFS. This worked pretty well for full table loads. Sqoop incremental could capture inserts _as well._

But CDC is not only about inserts. Where are my updates _and _deletes?

Year 2016 — We created a strategy to capture updates and deletes using triggers on database table and write changes to a shadow table. Once changed data is captured we would Sqoop to transfer the data over to HDFS. The method involved database modifications so a lot of our clients objected to it.

Year 2015–2016 — The use of a new open source project called *Debezium *was getting strong. For several years thereafter we used this CDC tool very effectively. Initially, Debezium supported only a limited number of databases but that was enough to cover most of our use case.

Debezium is able to query the database binary log and extract changes. It published each change as a JSON document to Kafka.

Image for post

Image by Author — Record before and After Image

Year 2016–Now — For AWS cloud deployments we typically use Amazon Database Migration Service (DMS). DMS can read change data sets from on-premises servers or RDS and publish it to many destinations including S3, Redshift, Kafka & Elasticsearch etc.

Let me show you how to create a sample CDC pipeline. We will start by creating an RDS database on AWS, create a sample database and finally setup Amazon DMS to perform change data capture to S3.

Let's start by downloading a sample data file

data-science aws big-data data artificial-intelligence

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Big Data and Business Intelligence: Transforming Business Dimensions

Learn how Big Data and Business Intelligence, both technologies helps the decision makers to make proper decisions that can help the organization to get advantages over their peers.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Data Science vs Data Analytics vs Big Data

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them