With the massive adoption of Apache Kafka, enterprises are looking for a way of replicating data across different sites. Kafka by itself has its own internal replication and self-healing mechanism which are only relevant to the local cluster and cannot tolerate a whole site failure. The solution for that, is the “Mirror Maker” feature, with this capability, your local Kafka cluster can be replicated asynchronously to a different external/central Kafka cluster that is located on a whole different location in order to persist your data pipelines, log collection, and metrics gathering processes.
The “Mirror Maker” connects between two clusters, as one of them is the consumer cluster and the other is the producer. Topics are being replicated as a logic entity with all that they have in store into the target cluster where an application can consume the data that is being transferred. The Mirror Maker can be horizontally scalable, which means that it can be scaled out when being the bottleneck.
In this article, we will use the AMQ Streams operator to deploy Kafka on a stretched Openshift cluster (where the nodes are located on different sites), and we’ll mirror all the messages that are being written to the source cluster into the target cluster using the “Mirror Maker” feature. In addition, we’ll use OCS RBD to save the Kafka logDirs, to see that OCS is topology agnostic and can serve nodes from different zones in the same cluster.
In the end, we’ll trace the response time of the whole pipeline using Jaeger, where we could see the response time for each component in the replication pipeline.
Game On!
Let’s start by creating a new project for this demo:
$ oc new-project amq-streams
After we have the project set up, let’s install the AMQ operator in the amq-streams
project and the Jaeger operator to watch all of the cluster namespaces:
Now that we have our operators installed, we can start creating some custom resources that will deploy our environment. First, let’s create our two clusters, where the europe-cluster
is the source cluster and the us-cluster
is the target one. Each one of the clusters will use OCS RBD to persist it’s written data.
$ oc create -f - <<EOF
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
name: europe-cluster
spec:
kafka:
version: 2.4.0
replicas: 3
listeners:
plain: {}
tls: {}
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
log.message.format.version: "2.4"
storage:
type: persistent-claim
size: 20Gi
deleteClaim: true
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
deleteClaim: true
entityOperator:
topicOperator: {}
userOperator: {}
---
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
name: us-cluster
spec:
kafka:
version: 2.4.0
replicas: 3
listeners:
plain: {}
tls: {}
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
log.message.format.version: "2.4"
storage:
type: persistent-claim
size: 20Gi
deleteClaim: true
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
deleteClaim: true
entityOperator:
topicOperator: {}
userOperator: {}
EOF
Now let’s verify that our clusters were indeed created and that they have claimed for the wanted storage from our OCS cluster:
$ oc get pods
NAME READY STATUS RESTARTS AGE
amq-streams-cluster-operator-v1.5.0-f9dc58f75-bqbm8 1/1 Running 0 3m23s
europe-cluster-entity-operator-5b5f7d44f7-57dbj 3/3 Running 0 37s
europe-cluster-kafka-0 2/2 Running 0 87s
europe-cluster-kafka-1 2/2 Running 0 87s
europe-cluster-kafka-2 2/2 Running 0 87s
europe-cluster-zookeeper-0 1/1 Running 0 2m29s
europe-cluster-zookeeper-1 1/1 Running 0 2m29s
europe-cluster-zookeeper-2 1/1 Running 0 2m29s
us-cluster-entity-operator-84fbbf445f-k5kjz 3/3 Running 0 35s
us-cluster-kafka-0 2/2 Running 0 95s
us-cluster-kafka-1 2/2 Running 0 95s
us-cluster-kafka-2 2/2 Running 0 95s
us-cluster-zookeeper-0 1/1 Running 0 2m29s
us-cluster-zookeeper-1 1/1 Running 0 2m29s
us-cluster-zookeeper-2 1/1 Running 0 2m29s
#machine-learning #kafka #big-data #containers #kubernetes #data analysis