Building a basic logging solution for Kubernetes could be as easy as running a couple of commands. However, to support large-scale Kubernetes clusters, the logging solution itself needs to be scalable and reliable.
In my previous blog, I described an overview of my Kubernetes monitoring and logging solution. At that time, I used a basic setup for logging: logs collected by Fluentd on the Kubernetes node are directly sent to a Elasticsearch and Kibana cluster for search and visualization, with both Fluentd and Elasticsearch running on the Kubernetes cluster itself. This is an easy setup that works for small clusters. Once we move to large production clusters, it will have challenges such as: 1) Fluentd may drop logs (data lose!) if Elasticsearch is down or cannot catch up indexing the incoming logs. 2) Log input and output are tightly coupled therefore difficult to manage. 3 ) Logs are only stored in Elasticsearch therefore difficult to extend to other tools, such as Apache Spark for general log processing and analytics.
In this blog, I will describe how I addressed these challenges by building a scalable and reliable Kubernetes logging solution with scalable tools such as Fluentd, Kafka, Elasticsearch, Spark and Trino. I will also highlight the role a fast object storage like FlashBlade S3 plays in this solution. The final architecture looks like the below.