The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations.
The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Monitoring the health of over 100 production facing Netflix applications, Telltale also serves as an intelligent incident management tool.
With metrics being very important to understand the application health, Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations, which is essential from an application's overall health aspect. To understand the health of application "at a glance," different colors and numbers are used to indicate severity.
The "heart of Telltale" is the application health model, which captures signals from different sources. The view of the application is created based on the type of these signals. Some of this model's sources include open-sourced Mantis, Netflix failover architecture Project Nimble, Netflix Streaming Supply Chain, alerts from the alerting system.
Telltale has a monitoring mechanism based on different algorithms: statistical, rule-based, or machine learning. There is no need for constant tuning of alerts sent out from the system. In addition to monitoring, Telltale's alerts are context-aware, sending the notification to teams via Slack, email, or PagerDuty. The incident updates are also sent in Slack message threads, ensuring better communication about the application's current state.
To provide a better context, when raising an incident alert, Telltale highlights possible causes. The post-incident review has Application Incident Summary showing all recent issues and total downtime, thereby creating an archive of incidents.
Looking for DevOps tools? See these 10 great tools for DevOps. You won't find such tools anywhere else. Free plans included, no BS.
The article comprises both very well established tools for those who are new to the DevOps methodology. DevOps has come to mean many things to each individual who uses the term as DevOps is not a singularly defined standard, software, or process but more of a culture.
DevOps and Cloud computing are joined at the hip, now that fact is well appreciated by the organizations that engaged in SaaS cloud and developed applications in the Cloud. During the COVID crisis period, most of the organizations have started using cloud computing services and implementing a cloud-first strategy to establish their remote operations. Similarly, the extended DevOps strategy will make the development process more agile with automated test cases.
A lot of tech companies struggle with creating an effective and efficient on-call schedule internally for their product and service, which results in longer downtimes when something goes wrong. They often over-burden their team members with repeated on-call duty, resulting in team fatigue. Here’s how to create an on-call schedule that your team might just love.
How to best monitor your external and third party API integrations and hold partners accountable to SLAs