On one particularly sunny day, not so long ago, we received a notice from some developers that there were a few “ERR_REDIS_NOT_CONNECTED” messages in the logging system — several times per day Redis would reset the connection from one of the services.

As a Network engineer, my goal is to then understand: is this a network problem or something else? Until I’m able to point to what’s causing the issue, I can’t rule out that the network may be to blame.

To get you up to speed, this is our infrastructure setup: Openstack is used as a cloud platform and we’re migrating some services to our own installation of Kubernetes.


TL;DR

It wasn’t a network problem

At first glance, it seems likely that the issues would be on networks inside of our environments — which is exactly why a network engineer was asked to troubleshoot this issue.

Step 1: Try to analyze the network data

Analyzation offered no interesting logs, no alerts, no anomalies, and no packet drops. The CPU load usage for network devices were low, buffers were empty, and the physical networks were far off from over-utilized. Microbursts can happen, but TCP usually handles them.

Step 2: Look for network problems on end devices

Nothing interesting was found here — no drops, not much traffic, not a lot of sessions in general, no logs on Kubernetes side, and no errors in Calico (our Kubernetes CNI).

Step 3: Inspect Redis configuration

This inspection was mostly to look for config lines that could affect existing sessions, things like timeouts, keepalives, session expiration, etc. We didn’t find anything interesting here either, only a standard config like other clusters (except other clusters weren’t experiencing such problems). Other services outside Kubernetes didn’t have similar issues so, most likely, Redis isn’t to blame. Moving along…

Step 4: Check TCP communication

We checked the TCP keepalives on both sides, just to make sure that the idle session wouldn’t drop by itself. The number of sessions from each pod and from Redis were also checked: 1 established session per pod and everything looks ideal.

#troubleshooting #openstack #network-engineering #engineering #kubernetes

Solving a “Simple” Network Logging Error
1.25 GEEK