In this short clip, our Software Engineer Rastislav Szabo gives a brief overview of how a typical monitoring, logging & alerting stack in a Kubernetes cluster looks like.
Many enterprises and SaaS companies depend on a variety of external API integrations in order to build an awesome customer experience. Some integrations may outsource certain business functionality such as handling payments or search to companies like Stripe and Algolia. You may have integrated other partners which expand the functionality of your product offering, For example, if you want to add real-time alerts to an analytics tool, you might want to integrate the PagerDuty and Slack APIs into your application.
If you’re like most companies though, you’ll soon realize you’re integrating hundreds of different vendors and partners into your app. Any one of them could have performance or functional issues impacting your customer experience. Worst yet, the reliability of an integration may be less visible than your own APIs and backend. If the login functionality is broken, you’ll have many customers complaining they cannot log into your website. However, if your Slack integration is broken, only the customers who added Slack to their account will be impacted. On top of that, since the integration is asynchronous, your customers may not realize the integration is broken until after a few days when they haven’t received any alerts for some time.
How do you ensure your API integrations are reliable and high performing? After all, if you’re selling a feature real-time alerting, you’re alerts better well be real-time and have at least once guaranteed delivery. Dropping alerts because your Slack or PagerDuty integration is unacceptable from a customer experience perspective.
Specific API integrations that have an exceedingly high latency could be a signal that your integration is about to fail. Maybe your pagination scheme is incorrect or the vendor has not indexed your data in the best way for you to efficiently query.
Average latency only tells you half the story. An API that consistently takes one second to complete is usually better than an API with high variance. For example if an API only takes 30 milliseconds on average, but 1 out of 10 API calls take up to five seconds, then you have high variance in your customer experience. This is makes it much harder to track down bugs and harder to handle in your customer experience. This is why 90th percentile and 95th percentiles are important to look at.
Reliability is a key metric to monitor especially since your integrating APIs that you don’t have control over. What percent of API calls are failing? In order to track reliability, you should have a rigid definition on what constitutes a failure.
While any API call that has a response status code in the 4xx or 5xx family may be considered an error, you might have specific business cases where the API appears to successfully complete yet the API call should still be considered a failure. For example, a data API integration that returns no matches or no content consistently could be considered failing even though the status code is always 200 OK. Another API could be returning bogus or incomplete data. Data validation is critical for measuring where the data returned is correct and up to date.
Not every API provider and integration partner follows suggested status code mapping
While reliability is specific to errors and functional correctness, availability and uptime is a pure infrastructure metric that measures how often a service has an outage, even if temporary. Availability is usually measured as a percentage of uptime per year or number of 9’s.
AVAILABILITY %DOWNTIME PER YEARDOWNTIME PER MONTHDOWNTIME PER WEEKDOWNTIME PER DAY90% (“one nine”)36.53 days73.05 hours16.80 hours2.40 hours99% (“two nines”)3.65 days7.31 hours1.68 hours14.40 minutes99.9% (“three nines”)8.77 hours43.83 minutes10.08 minutes1.44 minutes99.99% (“four nines”)52.60 minutes4.38 minutes1.01 minutes8.64 seconds99.999% (“five nines”)5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds99.9999% (“six nines”)31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds99.99999% (“seven nines”)3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds99.999999% (“eight nines”)315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds99.9999999% (“nine nines”)31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds
Many API providers are priced on API usage. Even if the API is free, they most likely have some sort of rate limiting implemented on the API to ensure bad actors are not starving out good clients. This means tracking your API usage with each integration partner is critical to understand when your current usage is close to the plan limits or their rate limits.
It’s recommended to tie usage back to your end-users even if the API integration is quite downstream from your customer experience. This enables measuring the direct ROI of specific integrations and finding trends. For example, let’s say your product is a CRM, and you are paying Clearbit $199 dollars a month to enrich up to 2,500 companies. That is a direct cost you have and is tied to your customer’s usage. If you have a free tier and they are using the most of your Clearbit quota, you may want to reconsider your pricing strategy. Potentially, Clearbit enrichment should be on the paid tiers only to reduce your own cost.
Monitoring API integrations seems like the correct remedy to stay on top of these issues. However, traditional Application Performance Monitoring (APM) tools like New Relic and AppDynamics focus more on monitoring the health of your own websites and infrastructure. This includes infrastructure metrics like memory usage and requests per minute along with application level health such as appdex scores and latency. Of course, if you’re consuming an API that’s running in someone else’s infrastructure, you can’t just ask your third-party providers to install an APM agent that you have access to. This means you need a way to monitor the third-party APIs indirectly or via some other instrumentation methodology.
#monitoring #api integration #api monitoring #monitoring and alerting #monitoring strategies #monitoring tools #api integrations #monitoring microservices
In this article, I want to showcase how one can get an email notification as an alert when “Something went wrong” using Log Analytics query to evaluate resources and Logs every set frequency (fire an alert based on the results). This article covers the basic features of Application Insight Log Alert.
Alerts proactively notify you when issues found with your infrastructure or application using your monitoring data in Azure Monitor. They allow you to identify and address issues before the users of your system notice them.
1. Request rates, response times, and failure rates: Find out which pages are most popular, at what times of day, and where your users are. See which pages perform best. If your response times and failure rates go high when there are more requests, then perhaps you have a resourcing problem.
2. Dependency rates, response times, and failure rates: Find out whether external services are slowing you down.
3. Exceptions: Analyze the aggregated statistics, or pick specific instances and drill into the stack trace and related requests. Both server and browser exceptions are reported.
4. Page views and load performance: Reported by your users’ browsers. AJAX calls from web pages - rates, response times, and failure rates. User and session count.
5. Performance counters from your Windows or Linux server machines, such as CPU, memory, and network usage. Host diagnostics from Docker or Azure.
6. Diagnostic trace logs from your app: so that you can correlate trace events with requests.
7. Custom events and metrics that you write yourself in the client or server code, to track business events such as items sold or games won.
It sends all logs and telemetry data to central database.
We are not monitoring the monitors all the time that is why we can configure Application Insights to send us alerts whenever something went wrong like ‘404 Errors’, ‘500 Errors’, ‘’ etc.
#azure #azure monitor #log alert
The project is located on GitHub: https://github.com/sevdokimov/log-viewer
There are many tools in the world for analysis logs, but most of my colleagues used the simple text editor or “less” command in the terminal. Log analysis tools can be divided into two groups:
The main disadvantage of #1 is the time of downloading logs from a remote machine. If the log weighs about 1G, it’s not usable to download it.
Log aggregator is the right solution for serious production environments, but they require additional resources for storing index and additional configuration to collect logs. In some cases, using a log aggregator is overkill.
I got an idea of how to make a log viewer that has some advantages of log aggregators, but actually, it’s a pure viewer. It doesn’t require additional storage for an index, doesn’t download log files to the local machine, but allows viewing logs on remote servers with nice features like filtering, search, merging events from several log files to one view.
The idea is to run a tiny Web UI on a server that provides access to log files located on the server. LogViewer doesn’t load an entire file to the memory. It loads only the part that the user is watching. This approach allows displaying huge log files without significant memory consumption. If we need monitoring logs on more than one node, LogViewer must be run on all nodes. Each LogViewer instance can connect to other instances and request data located on remote nodes. So each LogViewer can show events from the entire cluster merged into one view by date.
It’s important to make the tool as easy to use as possible. That’s why LogViewer can detect log format automatically, there is no required configuration, and log representation is close to a text editor representation.
#spring boot #log analytics #debugging tools #log viewer #a new tool for monitoring logs
Most enterprises already have a reliable logging and monitoring system in place, so why should you worry about it in the context of Kubernetes? Well, traditional logging and monitoring tools are designed for stable infrastructure and application deployments. Cloud native environments, on the other hand, are highly dynamic. The IT world has changed and so must your toolkit.
A key challenge is that traditional systems rely on anchors like IP or machine addresses, but in a cloud native setting, these factors are continuously changing. Containers are spun up and down, and redeployed on different VMs. The same applies to VMs which are redeployed on different node pools and even in different segments. To keep track of all this, you need a new breed of logging and monitoring system, one that is cloud native.
You do have a few options which we’ll categorize into four groups:
While managed solutions are easier and faster to set up, reducing time to market, you are also giving a third-party access to your logs. If you’re dealing with sensitive data, this is not an option. If data sensitivity isn’t an issue and you select a managed solution, you’ll need to integrate it with your app and infrastructure lifecycle practices. Meaning, as soon as a cluster is created, logging and monitoring must be automatically triggered. Otherwise, it’ll be a manual process and, as we all know, manual processes are error-prone.
In case you have a custom-built or pre-existing log collection and monitoring framework and are introducing a container orchestration platform into your technology stack, you need to consider and plan for the integration of these technologies. Make sure they really are compatible.
Cloud-hosted logging and monitoring are really convenient and cost-effective. If all your apps are running in a single cloud and you have no intention to change that in the near future, this is likely your best option. But beware, if you do decide to move to a different cloud, you won’t be able to migrate your system. A lot of work and effort will be involved.
We generally recommend the fourth category as it provides the flexibility to switch vendors or environments without having to reinvest into a new logging and monitoring solution. Some vendors, Kublr included, leverage popular open source projects such as Prometheus, Grafana, and Elasticsearch, which you can continue using even if you switch platforms. If provided by a vendor, it will already be pre-configured (e.g. the ability to scrape metrics, query, summarize, define custom dashboards and reports, etc.), speeding up your adoption of the cloud native stack.
The real value of open source tools such as Prometheus, Grafana, and Elasticsearch comes at a higher level, though. When building your own application alerts, dashboard, and charts, you can easily migrate them from one environment to another without losing all the work you’ve already done. There is a real opportunity here not to lock you in, and we believe you should take it.
#kubernetes #logging & monitoring system #self-managed logging and monitoring
In a production environment, a downtime of even a few microseconds is intolerable. Debugging such issues is time-critical. Proper logging and monitoring of infrastructure help in debugging such scenarios. It also helps in optimizing cost and other resources proactively. It also helps to detect any impending issue which may arise in the near future. There are various logging and monitoring solutions available in the market.
In this post, we will walk through the steps to deploy Grafana Loki in a Kubernetes environment for log monitoring and alerting. This is due to its seamless compatibility with Prometheus; a widely used software for collecting metrics. Grafana Loki consists of three components Promtail, Loki, and, Grafana (PLG) which we will see in brief before proceeding to the deployment. This article provides a better insight into the architectural differences of PLG and other primary logging and monitoring stack like Elasticsearch-FluentD-Kibana (EFK).
#grafana #loki #monitoring #alerting