Kubernetes is an extremely dynamic system. When operating the infrastructure in the K8s cluster, we always assume that any pod (or even a node!) might be deleted at any moment. To improve resilience, we are testing the system using various chaos engineering approaches. Mainly, we randomly kill Kubernetes nodes to see whether our applications are ready for pod restarts.
For the application to work correctly in such dynamic Kubernetes realities, it has to follow some basic rules such as creating PodDisruptionBudgets, deploying several replicas of the application simultaneously, correctly configuring podAffinity, nodeAffinity, and so on.
However, despite such obvious rules, we cannot make all our customers apply them at all times. In real life, we often face various difficulties/peculiarities… such as a customer’s application that runs as a single replica. It’s been running like that for months with hardly any redeployments. The developers carry out these rare redeployments all by themselves and without involving werf (this tool can do that automatically). At the same time, the registry is configured to automatically delete all old images. One day you would need to restart a pod, so you will end up in a disaster.
Recently, we have encountered such a case. That was a common rescheduling operation in the Kubernetes cluster that caused an hour-long downtime while we were looking for a person who could build the application. Frustrated by the situation, we’ve decided to make k8s-image-availability-exporter. The idea is to automate the needed checks to prevent the above situations from happening regardless of compliance with organizational policies and the existence of other “random” factors.
Its general algorithm is as follows:
--check-period
option) we take out of the priority queue another batch of images* that haven’t been checked for a long time. We check whether they are available in our container registry.#kubernetes #monitoring #docker-registry #prometheus