Why you should consider Pulsar over Kafka with examples.

Introduction

Recently, I’ve been looking at Pulsarand how it compares to Kafka. A quick search will show you that there is a current “war” between the two most famous open source messaging systems.

As a Kafka user, I do struggle with some of the issues with Kafka and I’m very exited about Pulsar. So finally, I managed to have some time to play around with it and I did quite a lot of research. In this article, I will focus on the Pulsar advantages and give you some reasons why you should consider it over Kafka. But let’s be clear, in terms of production usage, support, community, documentation, etc; Kafka clearly surpasses Pulsar, and I would only consider Pulsar if most of the advantages discussed in this article hold true to your use case. Let’s begin!

Kafka in a Nutshell

Kafka is the king of messaging systems. Created by LinkedIn in 2011, it has spread widely thanks to the support of Confluent who has released to the open source community many new features and add-ons such as Schema Registry for schema evolution, Kafka Connect for easy streaming from other data sources such as databases to Kafka, Kafka Streams for distributed stream processing, and most recently KSQL for performing SQL-like querying over Kafka topics and much more. It has also many connectors to many system, check Confluent Platform for more details.

Kafka is fast, easy to setup, extremely popular and can be used for a wide range or use cases. While Apache Kafka has always been friendly from a developer’s point of view, it has been something of a mixed bag operationally. So, lets’ review some the pain points of Kafka.

Image for post

Kafka example. Source: https://talks.rmoff.net/pZC6Za/slides

Kafka pain points

  • Scaling Kafka is tricky, this is due to the coupled architecture where brokers also store data. Spinning off another broker means it has to replicate the topic partitions and replicas, which is time consuming.
  • No native multi-tenancy with complete isolation of tenants .
  • Storage can become quite expensive, and although you can store data for a long period of time, it is rarely used because of the cost implications.
  • It is possible to loose messages in case replicas are out of sync.
  • You must plan and calculate number of brokers, topics, partitions and replicas ahead of time (that fits planned future usage growth) to avoid scaling problems, this is extremely difficult.
  • Working with offsets could be complicated if you just need a messaging system.
  • Cluster re-balancing can impact the performance of connected producers and consumers.
  • MirrorMaker Geo replication mechanism is problematic. Companies such Uber have created their own solution to overcome these issues.

As you can see, most of the problems are related to the operational aspects. Although, it is relative easy to setup up, Kafka is difficult to manage and tune. Also, it is not quite as flexible and resilient as it could be.

Pulsar in a Nutshell

Pulsar was created by Yahoo in 2013 and donated to the Apache foundation in 2016. Pulsar is now an Apache top level project. Yahoo, Verizon, Twitter among other companies use it in production to process millions of messages. It has many features and it is very flexible. It claims to be faster than Kafka and hence cheaper to run. It aims to solve most of the pain points of Kafka making it easier to scale.

Pulsar is very flexible; it can act as a distributed log like Kafka or a pure messaging system like RabbitMQ. It has multiple types of subscriptions, several delivery guarantees, retention policies and several ways to deal with schema evolution. It also has a big list of features…

Image for post

Pulsar architecture: https://pulsar.apache.org/docs/en/concepts-architecture-overview/

Pulsar Features

  • Multi-tenancy is built in, different teams can use the same cluster and be isolated. This solves many administration headaches. It supports isolation, authentication, authorization and quotas.
  • Multi tier architecture: Pulsar stores all the topic data in a specialized data layer powered by Apache BookKeeperas data ledgers. Separation of storage and messaging solves many issues with scaling, rebalancing and maintaining the cluster. It also improves reliability and makes almost impossible to loose data. Also, when reading the data, you can connect directly to Bookeeper without affecting the real time ingestion. For example, you can use Presto to execute SQL queries on your topics, similar to KSQL but with the peace of mind that this will not affect real time data processing.
  • Virtual topics. Because of the n-tier architecture, there is no limitation on the number of topics, topics and its storage are decoupled. You can also create non persistent topics.
  • N-tier storage. One problem of Kafka, is that storage can become expensive. So it is rarely used to store “cold” data and messages are often deleted.Apache Pulsar, through Tiered Storage, can automatically move older data to Amazon S3, or any other deep storage system; and still present a transparent view back to the client; the client can read from the start of time just as if all of the messages were present in the log.
  • Pulsar Functions. Easy to deploy, lightweight compute process, developer-friendly APIs, no need to run your own stream processing engine like Kafka.
  • Security: It has a built in proxy, multi tenant security, plug-able authentication and much more.
  • Fast Re-balancing. Partitions are split into segments that are easy to re balance.
  • Server side de-duplication and dead lettering. No need to do this in the client, also de duplication can be done during compaction.
  • Built in Schema registry. Supports multiple strategies and it is very easy to use.
  • Geo replication and built in discovery. It is very easy to replicate your cluster to multiple regions.
  • Integrated load balancer and Prometheus metrics.
  • Multiple integrations: Kafka, RabbitMQ and much more.
  • Support for many programming languages such GoLang, Java, Scala, Node, Python…
  • Clients do not need to be aware of shards and data partition, this is done transparently on the server side.

Image for post

List of features: https://pulsar.apache.org/

As you can see, Pulsar is has lots of interesting features.

#microservices #kafka #aws #pulsar #big-data

Pulsar Advantages Over Kafka
5.55 GEEK