Part One: ClickHouse Failures, by Marcel Birkner. ... Fixing The ClickHouse Node Failure On Distributed Systems - A How-To Guide.
Some weeks are more challenging for an SRE than others. Some days you have more time working on projects, on other days you need to deal with production problems.
Last week was one of those weeks for me.
We had data node failures, had to do root-cause analysis, fix issues and find ways to prevent the same problem from occurring again. It all started out with failures in different datastore clusters: ClickHouse, Kafka, ElasticSearch and Cassandra. One thing all of these datastores have in common is that they run in large clusters with data replication enabled across availability zones. Therefore a single failing node does not impact our customers. Our system is built to handle these kinds of things. Nevertheless it is critical to resolve these problems so that they do not escalate.
For each component in our architecture we have a set of service level objectives defined as events that are automatically updated with each release. One of these events alerted us via OpsGenie, that distributed inserts for a ClickHouse node are piling up.
A little bit of background on ClickHouse. We run several ClickHouse clusters in our regions. Each cluster consists of up to ten shards with two nodes per shard for data replication. We use distributed tables to store the data. Every time a component wants to write data to a distributed table it sends data to an arbitrary ClickHouse node. This node then takes care of forwarding the data to other nodes. ClickHouse stores this data temporarily on disk before forwarding the data to another node.
What is DevOps? How are organizations transitioning to DevOps? Is it possible for organizations to shift to enterprise DevOps? Read more to find out!
DevOps and Cloud computing are joined at the hip, now that fact is well appreciated by the organizations that engaged in SaaS cloud and developed applications in the Cloud. During the COVID crisis period, most of the organizations have started using cloud computing services and implementing a cloud-first strategy to establish their remote operations. Similarly, the extended DevOps strategy will make the development process more agile with automated test cases.
Hire our Dedicated DevOps Developers who have in-depth skills and expertise to develop an interactive and secure web application. Get custom DevOps solutions for your project.
How to Choose Monitoring Tools for DevOps and SRE: Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post.
And to achieve observability in serverless applications, it's important ... Monitoring checks “known” metrics to evaluate the health of the system.