Netflix Presents Telltale, an Application Health Monitoring Tool

Netflix Presents Telltale, an Application Health Monitoring Tool

The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations.

The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Monitoring the health of over 100 production facing Netflix applications, Telltale also serves as an intelligent incident management tool.

With metrics being very important to understand the application health, Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations, which is essential from an application's overall health aspect. To understand the health of application "at a glance," different colors and numbers are used to indicate severity.

Source: https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba

The "heart of Telltale" is the application health model, which captures signals from different sources. The view of the application is created based on the type of these signals. Some of this model's sources include open-sourced Mantis, Netflix failover architecture Project NimbleNetflix Streaming Supply Chain, alerts from the alerting system.

Telltale has a monitoring mechanism based on different algorithms: statistical, rule-based, or machine learning. There is no need for constant tuning of alerts sent out from the system. In addition to monitoring, Telltale's alerts are context-aware, sending the notification to teams via Slack, email, or PagerDuty. The incident updates are also sent in Slack message threads, ensuring better communication about the application's current state.

To provide a better context, when raising an incident alert, Telltale highlights possible causes. The post-incident review has Application Incident Summary showing all recent issues and total downtime, thereby creating an archive of incidents.

monitoring tools incident response netflix devops news

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

10 Best Tools for DevOps You’ve Never Heard About - DZone DevOps

Looking for DevOps tools? See these 10 great tools for DevOps. You won't find such tools anywhere else. Free plans included, no BS.

50+ Useful DevOps Tools

The article comprises both very well established tools for those who are new to the DevOps methodology. DevOps has come to mean many things to each individual who uses the term as DevOps is not a singularly defined standard, software, or process but more of a culture.

How to Extend your DevOps Strategy For Success in the Cloud?

DevOps and Cloud computing are joined at the hip, now that fact is well appreciated by the organizations that engaged in SaaS cloud and developed applications in the Cloud. During the COVID crisis period, most of the organizations have started using cloud computing services and implementing a cloud-first strategy to establish their remote operations. Similarly, the extended DevOps strategy will make the development process more agile with automated test cases.

How to Build an Effective and Sustainable On-Call Schedule For Your Team

A lot of tech companies struggle with creating an effective and efficient on-call schedule internally for their product and service, which results in longer downtimes when something goes wrong. They often over-burden their team members with repeated on-call duty, resulting in team fatigue. Here’s how to create an on-call schedule that your team might just love.

How to Monitor Third Party API Integrations

How to best monitor your external and third party API integrations and hold partners accountable to SLAs