Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.
At the heart of all SRE decision-making is data. Without logging latency, availability, and other reliability metrics throughout your system, you’ll have no way of knowing where to invest your development efforts. Several monitoring tools such as AppDynamics, Datadog, Grafana, and Prometheus are available to help collect this data and display it in efficient ways.
Monitoring can be broken down into four main categories:
To get a full picture of your service, you’ll want to incorporate elements of all four of these categories. Most monitoring tools will provide options for multiple categories. Look for ones that integrate well with your existing tool stack, as you’ll need the monitoring tool to be able to gather and interpret data directly from your existing sources.
Try to find tools that can generate visualizations and reports that your team will find useful. For example, if you’re trying to see which services generate the most network traffic, look for a tool that can create pie charts of overall network usage.
#tools #devops #sre #tools and methods #tools 2020