My team, Expedia Group™ Commerce Data, needed to join events coming in on two (and more in the future) Kafka topics to provide a realtime stream view of our bookings. This is a pretty standard requirement, but our team was not very experienced with Kafka Streams, and we had a few wrinkles that made going with an “out of the box” Kafka Streams join less attractive than dropping down to the Processor API.

What we needed, in a nutshell, was to:

  • Join two or more events,
  • Repartition one event to extract the proper join key,
  • Report on unjoined events,
  • Possibly purge orphaned events to a dead letter topic,
  • Configurable no set join window (for expiration of unjoined events),
  • Oh, and with Kafka Streams newbies at the helm.

Processor API vs DSL

There are two approaches to writing a Kafka Streams application:

Developers prefer the DSL for most situations because it simplifies some common use cases and lets you accomplish a lot with very little code. But you sacrifice some control when using the DSL. There’s a certain amount of magic going on under the covers that’s hidden by the KStream and KTable abstractions. And the out-of-the-box joins available between these abstractions may not fit all use cases.

The most common way I see the DSL characterized is as “expressive,” which just means “hides lots of stuff from you.” Sometimes explicit is better. And for some (like me), the “raw” Processor API just seems to fit my brain better than the DSL abstractions.

Don’t fear the Processor API

Most documentation I found around Kafka Streams leans towards using the DSL (Confluent docs state “it is recommended for most users”), but the Processor API has a much simpler interface than the DSL in many respects. You still build a stream topology, but you only use Source nodes (to read from Kafka topics), Sink nodes (to write to Kafka topics), and Processor nodes (to do stuff to Kafka events flowing through your topology). Plus the DSL is built on top of the Processor API, so if it’s good enough for the DSL, it should be good enough for our humble project (in fact, as a Confluent engineer says, “the DSL compiles down to the Processor API”).

Processor nodes have to implementProcessor, which has a process method you override which takes the key and the value of the event that is traversing your Kafka Streams topology. Processors also have access to aProcessorContext object which contains useful information on the current event being processed (like what topic & partition it was consumed from) and a forward method that is used to send the event to a downstream node in your topology.

To illustrate the difference, here’s a comparison of doing a repartition on a stream in the DSL and the Processor API.

#kafka-streams #kafka #streaming #data-science

Stateful Joins with the Kafka Streams Processor API
3.70 GEEK