A few months ago, we started a new project. Our goal was to design a microservice which could handle many concurrent connections. We predicted that the application would spend much of its time waiting for multiple parallel I/O operations. The ideal architectural solution appeared to use a non-blocking approach. After a short investigation, we decided to use Spring WebFlux as the main framework. This is because it’s based on a non-blocking stack, promising great scalability and flexibility. A great abstraction in the form of streams looked really convenient to use, much more so than traditional non-blocking code based on Java NIO or Netty. That’s why we attempted to implement the service based on Spring WebFlux. However, once our development work began, trouble soon followed.

New mindset

If you’re accustomed to writing code in a traditional, sequential way, you’ll probably feel a little lost. That was our impression when we started the project. Sometimes, simple things seemed difficult. Most surprising of all, even experienced Java developers had to reinvent the wheel. It was a new programming paradigm for them, and it wasn’t easy to switch to it after many years of programming in a different way. The truth is that reactive streams simplify some tasks a lot, but others seem much more complicated. Moreover, a code which is written as a sequence of operators, like map or flatMap, is not as easy to understand as a traditional, line-by-line sequence of blocking methods calls. You cannot simply extract any piece of code to a private method to make it more readable. You have to use the sequence of operators on a stream instead.

My observation is that reactive streams are less painful for open-minded developers. Although the learning curve is definitely steep, I think it’s worth investing the time.

Logging contextual information

One of our technical requirements was traceability of conversations between services. The Correlation ID pattern was exactly what we needed. In a nutshell: each incoming message has attached a unique text value, which is a conversation identifier, generated somewhere outside of our service. In our case, we have two types of incoming message:

An HTTP request.
A message consumed from the SQS (Amazon Simple Queue Service) queue.

To provide traceability, we have to include the Correlation ID alongside each log line, which is printed by the code execution being in the context of a given conversation. If each service taking part in the conversation prints a Correlation ID, it’s very easy to find all logs related to it. The logs aggregator helps a lot, because it collects all of them in a single place, providing convenient searching capability.

The most commonly used logging libraries implement the Mapped Diagnostic Context (MDC) feature, which fitted our needs perfectly.

In a servlets world, we would write a filter, which handles each incoming HTTP request. The filter would put the Correlation ID to the MDC before the request handlers (i.e. Spring controllers) start, and clean up the map when the request processing is finished. It would work perfectly, but bear in mind that MDC uses a Thread Affinity pattern (implemented as ThreadLocal class in Java) underneath, holding contextual information associated with a single thread. It means that it can only be used when a single thread handles and processes the request from beginning to end. Unfortunately, Spring WebFlux does not work that way. It’s very likely that more than one thread takes part in a single request processing. So what can we do then? Thankfully, the authors of the Reactor library implemented another mechanism to hold contextual information — the Context. It can be used to transfer the Correlation ID between the stream’s operators. To link it with a well-known MDC of logging libraries, we used the solution described here: https://www.novatec-gmbh.de/en/blog/how-can-the-mdc-context-be-used-in-the-reactive-spring-applications/. If you’re not a fan of Kotlin language, check out the similar implementation in Java: https://github.com/archie-swif/webflux-mdc. We implemented this idea and it works perfectly! We just had to create a custom WebFilter, which reads the Correlation ID from the HTTP request and sets it in the context of reactive streams. We spotted one drawback — stacktraces became much longer, due to implicit calls to MdcContextLifter.

Very long stack trace

The beauty of reactive streams is that they form a chain of declarations telling what happens with each element emitted from the publisher. This abstraction allows for clear implementation of complex logic. If the code works as expected, then everything looks shiny. You feel proud that you’ve solved a non-trivial problem with just a few lines of code. The happiness lasts until someone sends you a stack trace from the production. A vast majority of lines that you see start with _reactor.core.publisher _(and others from the package containing MdcContextLifter class, if you implemented the workaround for contextual information logging mentioned in the previous section). It doesn’t tell you anything, because it is a package unrelated to your code. I would say it is useless.

Moreover, I recall a very serious problem that we encountered. Users complained that the application returns errors occasionally, but we couldn’t figure out why. The logs didn’t have anything interesting, although we knew that the application catches every conceivable error and logs on error levels with a stacktrace. So what was the problem? We used Papertrail as the log aggregator. The forwarder, which pushes application logs to the Papertrail, has a limit on the maximum size of log entry set to 100,000 bytes. It sounds like more than enough, but it wasn’t in our case. Stacktraces were much bigger and full of irrelevant information, like the aforementioned method calls in package reactore.code.publisher. If the Papertrail forwarder sees a very long message, it simply skips it. That’s why the most critical log entries were just ignored. So what did we do? We simply implemented an extension to the Logback, which drops lines having packages starting from reactor.core.publisher, and trims the stacktrace to 90,000 characters.

My conclusion is that further effort is required to implement a similar capability, particularly if you don’t want to see irrelevant information and are serious about reducing the risk of problems.

#software-development #reactive-programming #java #backend #spring

New mindset

Logging contextual information

Very long stack trace

blog.pragmatists.com

Unobvious traps of Spring WebFlux