Within the software systems, most often than not we wanted to spin up some temporary services or jobs that can be terminated as soon as it performed a specific task. For example, if we want to send out user notifications everyday morning, then we don’t want that service to be running all day long, rather a service can spin up at a specific time, perform the task, and shutdown thereby effectively saving resources & cost.

As with any other long-running services, these ephemeral (or in Kubernetes term Batch or Cron) Jobs also needs to be monitored to gather critical metrics and for Altering when something goes wrong.

This article describes how we can monitor these short-lived jobs by explaining

What is the problem with Prometheus
What is PushGateway & why it is necessary
Demo using a sample application

Prometheus Architecture:

A short introduction about Prometheus if you haven’t known it already,

Prometheus is a Open-source metrics-based monitoring system that can record Multidimensional time-series data. It supports powerful querying, dashboard, and Alerting. Offers several integrations to connect to a variety of systems.

How does it work?

Prometheus uses a Pull model (also called Scraping) to collect metrics, meaning the Prometheus server will reach out to specified services by calling their configured HTTP endpoint to pull those metrics.

For example, the following configuration defined in prometheus.yml file tells the Prometheus servers to fetch metrics every 5s on the specified endpoint

scrape_configs:

- job_name: 'auth_server'
scrape_interval: 5s
static_configs:
- targets: ['auth.server.com:8080/metrics']

Problem?

Scarping is good for long-running services since those services will be available for a long time for the Prometheus servers to make a request and collect their metrics. This architecture is good, it helps to achieve a few other benefits such as health checks, no bottlenecks, etc.

But for short-lived services such as K8s Batch/Cron Jobs, by the time Prometheus decides to collect metrics, those pod might long be terminated

For these kinds of use cases, a Push model is necessary so that those services can publish its metrics when desired for example, during the shutdown

#serverless #devops #prometheus #programming #software-engineering #pushgateway

Prometheus Architecture:

How does it work?

Problem?

itnext.io

Ephemeral Jobs Monitoring Using Prometheus PushGateway