Data Pipeline Using MongoDB and Kafka Connect on Kubernetes

In Kafka Connect on Kubernetes, the easy way!, I had demonstrated [Kafka Connect](https://kafka.apache.org/documentation/#connect) on Kubernetes using [Strimzi](http://strimzi.io/) along with the File source and sink connector. This blog will showcase how to build a simple data pipeline with MongoDB and Kafka with the MongoDB Kafka connectors, which will be deployed on Kubernetes with Strimzi.

I will be using the following Azure services:

Please note that there are no hard dependencies on these components, and the solution should work with alternatives as well

Azure Event Hubs for Apache Kafka (any other Kafka cluster should work fine)
Azure Kubernetes Service (feel free to use minikube, kind etc.)
Azure Cosmos DB as the MongoDB database, thanks to Azure Cosmos DB’s API for MongoDB

In this tutorial, Kafka Connect components are being deployed to Kubernetes, but it is also applicable to any Kafka Connect deployment

What’s covered?

MongoDB Kafka Connector and Strimzi overview
Azure specific (optional) - Azure Event Hubs, Azure Cosmos DB and Azure Kubernetes Service
Setup and operate Source and Sink connectors
Test end to end scenario

Overview

Here is an overview of the different components:

I have used a contrived/simple example in order to focus on the plumbing, moving parts

MongoDB Kafka Connector(s)

The MongoDB Kafka Connect integration provides two connectors: Source and Sink

Source Connector: It pulls data from a MongoDB collection (that acts as a source) and writes them to Kafka topic
Sink connector: It is used to process the data in Kafka topic(s), persist them to another MongoDB collection (thats acts as a sink)

These connectors can be used independently as well, but in this blog, we will use them together to stitch the end-to-end solution

`Strimzi` overview

Strimzi simplifies the process of running Apache Kafka in a Kubernetes cluster by providing container images and Operators for running Kafka on Kubernetes. It is a part of the Cloud Native Computing Foundation as a [Sandbox](https://www.cncf.io/sandbox-projects/) project (at the time of writing)

Strimzi Operators are fundamental to the project. These Operators are purpose-built with specialist operational knowledge to effectively manage Kafka. Operators simplify the process of: Deploying and running Kafka clusters and components, Configuring and securing access to Kafka, Upgrading and managing Kafka and even taking care of managing topics and users.

Prerequisites

kubectl - https://kubernetes.io/docs/tasks/tools/install-kubectl/

If you choose to use Azure Event Hubs, Azure Kubernetes Service or Azure Cosmos DB you will need a Microsoft Azure account. Go ahead and sign up for a free one!

Azure CLI or Azure Cloud Shell - you can either choose to install the Azure CLI if you don’t have it already (should be quick!) or just use the Azure Cloud Shell from your browser.

Helm

I will be using Helm to install Strimzi. Here is the documentation to install Helm itself - https://helm.sh/docs/intro/install/

Let’s start by setting up the required Azure services (if you’re not using Azure, skip this section but please ensure you have the details for your Kafka cluster i.e. broker URLs and authentication credentials, if applicable)

Azure Cosmos DB

You need to create an Azure Cosmos DB account with the MongoDB API support enabled along with a Database and Collection. Follow these steps to setup Azure Cosmos DB using the Azure portal:

Create an Azure Cosmos DB account
Add a database and collection and get the connection string

#nosql #mongodb #azure #kafka #databases #cloud (add topic) #azure cosmos db #kafka connect platform