If you’ve migrated from a monolith to a microservices architecture you’ve probably experienced this sentiment:
Modern systems today are far more complex to monitor.
Microservices combined with containerized deployment results in highly dynamic systems with many moving parts across multiple layers.
These systems emit massive amounts of highly dimensional telemetry data from hardware and the operating system, through Docker and Kubernetes, all the way to application and business performance metrics.
Many have come to realize that the commonly prescribed Graphite+StatsD monitoring stack is no longer sufficient to cover their backs. Why is that? And what is the current approach to monitoring?
In this post, I will look at the characteristics of modern systems and the new challenges they raise for monitoring. I will also discuss common solutions for monitoring, from the days of Graphite and StatsD to the currently dominant Prometheus. I’ll review the benefits of Prometheus as well as its limitations, and how open source time-series databases can help overcome some of these limitations. The challenges of monitoring modern systems
Let’s have a look at the changes in software systems and the challenges they introduce in system monitoring.
1. Open source and cloud services
Open source and SaaS increased the selection of readily available libraries, tools, and frameworks for building software systems, including web servers, databases, message brokers, and queues. That turned system architecture into a composite of multiple third-party frameworks. Moreover, many frameworks are now clustered solutions with their own suite of moving parts (just think Hadoop or Kafka, compared to traditional MySQL).
#kubernetes #opensource #timeseries #microservices
Many enterprises and SaaS companies depend on a variety of external API integrations in order to build an awesome customer experience. Some integrations may outsource certain business functionality such as handling payments or search to companies like Stripe and Algolia. You may have integrated other partners which expand the functionality of your product offering, For example, if you want to add real-time alerts to an analytics tool, you might want to integrate the PagerDuty and Slack APIs into your application.
If you’re like most companies though, you’ll soon realize you’re integrating hundreds of different vendors and partners into your app. Any one of them could have performance or functional issues impacting your customer experience. Worst yet, the reliability of an integration may be less visible than your own APIs and backend. If the login functionality is broken, you’ll have many customers complaining they cannot log into your website. However, if your Slack integration is broken, only the customers who added Slack to their account will be impacted. On top of that, since the integration is asynchronous, your customers may not realize the integration is broken until after a few days when they haven’t received any alerts for some time.
How do you ensure your API integrations are reliable and high performing? After all, if you’re selling a feature real-time alerting, you’re alerts better well be real-time and have at least once guaranteed delivery. Dropping alerts because your Slack or PagerDuty integration is unacceptable from a customer experience perspective.
Specific API integrations that have an exceedingly high latency could be a signal that your integration is about to fail. Maybe your pagination scheme is incorrect or the vendor has not indexed your data in the best way for you to efficiently query.
Average latency only tells you half the story. An API that consistently takes one second to complete is usually better than an API with high variance. For example if an API only takes 30 milliseconds on average, but 1 out of 10 API calls take up to five seconds, then you have high variance in your customer experience. This is makes it much harder to track down bugs and harder to handle in your customer experience. This is why 90th percentile and 95th percentiles are important to look at.
Reliability is a key metric to monitor especially since your integrating APIs that you don’t have control over. What percent of API calls are failing? In order to track reliability, you should have a rigid definition on what constitutes a failure.
While any API call that has a response status code in the 4xx or 5xx family may be considered an error, you might have specific business cases where the API appears to successfully complete yet the API call should still be considered a failure. For example, a data API integration that returns no matches or no content consistently could be considered failing even though the status code is always 200 OK. Another API could be returning bogus or incomplete data. Data validation is critical for measuring where the data returned is correct and up to date.
Not every API provider and integration partner follows suggested status code mapping
While reliability is specific to errors and functional correctness, availability and uptime is a pure infrastructure metric that measures how often a service has an outage, even if temporary. Availability is usually measured as a percentage of uptime per year or number of 9’s.
AVAILABILITY %DOWNTIME PER YEARDOWNTIME PER MONTHDOWNTIME PER WEEKDOWNTIME PER DAY90% (“one nine”)36.53 days73.05 hours16.80 hours2.40 hours99% (“two nines”)3.65 days7.31 hours1.68 hours14.40 minutes99.9% (“three nines”)8.77 hours43.83 minutes10.08 minutes1.44 minutes99.99% (“four nines”)52.60 minutes4.38 minutes1.01 minutes8.64 seconds99.999% (“five nines”)5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds99.9999% (“six nines”)31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds99.99999% (“seven nines”)3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds99.999999% (“eight nines”)315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds99.9999999% (“nine nines”)31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds
Many API providers are priced on API usage. Even if the API is free, they most likely have some sort of rate limiting implemented on the API to ensure bad actors are not starving out good clients. This means tracking your API usage with each integration partner is critical to understand when your current usage is close to the plan limits or their rate limits.
It’s recommended to tie usage back to your end-users even if the API integration is quite downstream from your customer experience. This enables measuring the direct ROI of specific integrations and finding trends. For example, let’s say your product is a CRM, and you are paying Clearbit $199 dollars a month to enrich up to 2,500 companies. That is a direct cost you have and is tied to your customer’s usage. If you have a free tier and they are using the most of your Clearbit quota, you may want to reconsider your pricing strategy. Potentially, Clearbit enrichment should be on the paid tiers only to reduce your own cost.
Monitoring API integrations seems like the correct remedy to stay on top of these issues. However, traditional Application Performance Monitoring (APM) tools like New Relic and AppDynamics focus more on monitoring the health of your own websites and infrastructure. This includes infrastructure metrics like memory usage and requests per minute along with application level health such as appdex scores and latency. Of course, if you’re consuming an API that’s running in someone else’s infrastructure, you can’t just ask your third-party providers to install an APM agent that you have access to. This means you need a way to monitor the third-party APIs indirectly or via some other instrumentation methodology.
#monitoring #api integration #api monitoring #monitoring and alerting #monitoring strategies #monitoring tools #api integrations #monitoring microservices
On Wednesday, March 11, 2020, I conducted the webinar titled “Monitoring & Orchestrating Your Microservices Landscape using Workflow Automation”. Not only was I overwhelmed by the number of attendees, but we also got a huge list of interesting questions before and especially during the webinar. Some of them were answered, but a lot of them were not. I want to answer all open questions in this series of seven blog posts. Today I am posting the final two in the series.
Note that we also started to experiment with the Camunda’s question corner and discuss to make this more frequent, so keep an eye to our community for more opportunities to ask anything (especially as in-person events are canceled for some time).
Part 1: BPMN & modeling related questions (6 answers)
Part 2: Architecture related questions (12)
Part 3: Stack & technology questions (6)
Part 4: Camunda product-related questions (5)
Part 5: Camunda Optimize specific questions (3)
Part 6: Questions about best practices (5)
This is quite a complex question, as it depends on the exact architecture and technology you want to use.
Example 1: You use Camunda embedded as a library, probably using the Spring Boot starter. In this case, your business data could live in the same database as the workflow context. In this case you can join one ACID transaction and everything will be strongly consistent.
Example 2: You leverage Camunda Cloud and code your service in Node.JS, storing data in some database. Now you have no shared transaction. No you start living in the eventual consistent world, and need to rely on “at-least-once” semantics. This is not a problem per se, but at least requires some thinking about the situations that can arise. I should probably write an own piece about that, but I had used this picture in the past to explain the problem (and this very basic blog post might help also):
So you can end up with money charged on the credit card, but the workflow not knowing about it. But in this case you leverage the retry capabilities and will be fine soon (=eventually).
#microservices #monitoring-microservices #microservices-or #workflow-automation #process-automation #bpmn #workflow #developers-workflow
The shift towards microservices and modular applications makes testing more important and more challenging at the same time. You have to make sure that the microservices running in containers perform well and as intended, but you can no longer rely on conventional testing strategies to get the job done.
This is where new testing approaches are needed. Testing your microservices applications require the right approach, a suitable set of tools, and immense attention to details. This article will guide you through the process of testing your microservices and talk about the challenges you will have to overcome along the way. Let’s get started, shall we?
Traditionally, testing a monolith application meant configuring a test environment and setting up all of the application components in a way that matched the production environment. It took time to set up the testing environment, and there were a lot of complexities around the process.
Testing also requires the application to run in full. It is not possible to test monolith apps on a per-component basis, mainly because there is usually a base code that ties everything together, and the app is designed to run as a complete app to work properly.
Microservices running in containers offer one particular advantage: universal compatibility. You don’t have to match the testing environment with the deployment architecture exactly, and you can get away with testing individual components rather than the full app in some situations.
Of course, you will have to embrace the new cloud-native approach across the pipeline. Rather than creating critical dependencies between microservices, you need to treat each one as a semi-independent module.
The only monolith or centralized portion of the application is the database, but this too is an easy challenge to overcome. As long as you have a persistent database running on your test environment, you can perform tests at any time.
Keep in mind that there are additional things to focus on when testing microservices.
Test containers are the method of choice for many developers. Unlike monolith apps, which lets you use stubs and mocks for testing, microservices need to be tested in test containers. Many CI/CD pipelines actually integrate production microservices as part of the testing process.
As mentioned before, there are many ways to test microservices effectively, but the one approach that developers now use reliably is contract testing. Loosely coupled microservices can be tested in an effective and efficient way using contract testing, mainly because this testing approach focuses on contracts; in other words, it focuses on how components or microservices communicate with each other.
Syntax and semantics construct how components communicate with each other. By defining syntax and semantics in a standardized way and testing microservices based on their ability to generate the right message formats and meet behavioral expectations, you can rest assured knowing that the microservices will behave as intended when deployed.
#testing #software testing #test automation #microservice architecture #microservice #test #software test automation #microservice best practices #microservice deployment #microservice components
We have been building software applications for many years using various tools, technologies, architectural patterns and best practices. It is evident that many software applications become large complex monolith over a period for various reasons. A monolith software application is like a large ball of spaghetti with criss-cross dependencies among its constituent modules. It becomes more complex to develop, deploy and maintain monoliths, constraining the agility and competitive advantages of development teams. Also, let us not undermine the challenge of clearing any sort of technical debt monoliths accumulate, as changing part of monolith code may have cascading impact of destabilizing a working software in production.
Over the years, architectural patterns such as Service Oriented Architecture (SOA) and Microservices have emerged as alternatives to Monoliths.
SOA was arguably the first architectural pattern aimed at solving the typical monolith issues by breaking down a large complex software application to sub-systems or “services”. All these services communicate over a common enterprise service bus (ESB). However, these sub-systems or services are actually mid-sized monoliths, as they share the same database. Also, more and more service-aware logic gets added to ESB and it becomes the single point of failure.
Microservice as an architectural pattern has gathered steam due to large scale adoption by companies like Amazon, Netflix, SoundCloud, Spotify etc. It breaks downs a large software application to a number of loosely coupled microservices. Each microservice is responsible for doing specific discrete tasks, can have its own database and can communicate with other microservices through Application Programming Interfaces (APIs) to solve a large complex business problem. Each microservice can be developed, deployed and maintained independently as long as it operates without breaching a well-defined set of APIs called contract to communicate with other microservices.
#microservice architecture #microservice #scaling #thought leadership #microservices build #microservice
This write up is not about how to implement microservices. It’s more about what we need to consider when designing microservices architecture and when they are in use. But first we need to broadly agree on what microservices are.
What are Microservices?
It’s 2020, so most of us can define what a Microservice is, even if we have not used them. It can go something like:
And if we get a bit into domain/team:
Problems start when:
A lot of us started with a model like the following:
#microservices #microservices-visibility #microservices-design