My team, Expedia Group™ Commerce Data, needed to join events coming in on two (and more in the future) Kafka topics to provide a realtime stream view of our bookings. This is a pretty standard requirement, but our team was not very experienced with Kafka Streams, and we had a few wrinkles that made going with an “out of the box” Kafka Streams join less attractive than dropping down to the Processor API.
What we needed, in a nutshell, was to:
There are two approaches to writing a Kafka Streams application:
Developers prefer the DSL for most situations because it simplifies some common use cases and lets you accomplish a lot with very little code. But you sacrifice some control when using the DSL. There’s a certain amount of magic going on under the covers that’s hidden by the KStream
and KTable
abstractions. And the out-of-the-box joins available between these abstractions may not fit all use cases.
The most common way I see the DSL characterized is as “expressive,” which just means “hides lots of stuff from you.” Sometimes explicit is better. And for some (like me), the “raw” Processor API just seems to fit my brain better than the DSL abstractions.
Processor
APIMost documentation I found around Kafka Streams leans towards using the DSL (Confluent docs state “it is recommended for most users”), but the Processor API has a much simpler interface than the DSL in many respects. You still build a stream topology, but you only use Source nodes (to read from Kafka topics), Sink nodes (to write to Kafka topics), and Processor nodes (to do stuff to Kafka events flowing through your topology). Plus the DSL is built on top of the Processor API, so if it’s good enough for the DSL, it should be good enough for our humble project (in fact, as a Confluent engineer says, “the DSL compiles down to the Processor API”).
Processor nodes have to implementProcessor, which has a
process method you override which takes the key and the value of the event that is traversing your Kafka Streams topology. Processors also have access to a
ProcessorContext object which contains useful information on the current event being processed (like what topic & partition it was consumed from) and a forward
method that is used to send the event to a downstream node in your topology.
To illustrate the difference, here’s a comparison of doing a repartition on a stream in the DSL and the Processor API.
#kafka-streams #kafka #streaming #data-science