Streamlining System Monitoring and Alerting with Prometheus

benedev

Benny Yang

Posted on March 14, 2024

Streamlining System Monitoring and Alerting with Prometheus

Prometheus stands as a pivotal player in the monitoring and alerting landscape.
Its core strength lies in its ability to efficiently collect and query time-series data in real-time, making it indispensable for monitoring system health, performance metrics and alerting on anomalies.


Prometheus Alert Rules

Prometheus allows us to define alert rules in a dedicated .yml file based on PromQL expressions and dynamically import them in rule_files section in the configuration file, it supports wildcard directories to implement multiple files simultaneously.

// Prometheus.yml
# my global config
global:
  scrape_interval: 60s # Set the scrape interval to every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - '/etc/prometheus/rules/*.yml'

# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:

Enter fullscreen mode Exit fullscreen mode

With beyond configuration, the rules will be evaluated every 15 seconds, let's move on to the rule files :

groups:
  - name: b2ac9e71-589e-494c-af38-81760d4eeab9_..._rules
    rules:
      - alert: temp_high_warm_laser/101
        expr: 
          device_temperature_celsius{device="laser/101"} >= 30 and
          dfost_pu_luminis_device_temperature_celsius{device="laser/101"} < 80
        for: 1m
        labels:
          severity: temp_high_warm
          deviceId: test-device
          type: temperature
          operatingRange: 25-30
          value: "{{ $value }}"
        annotations:
          summary: Device temperature is between 30°C and 80°C
          description: "Device temperature is between 30°C and 80°C"

Enter fullscreen mode Exit fullscreen mode

Let's breakdown each field one by one!

  • Groups: A collection of rules. Groups help organize rules by their purpose or by the services they monitor.
  • Rules: Defines the individual alerting or recording rules within a group. Each rule specifies conditions under which alerts should be fired or metrics should be recorded.
  • Alert: Name of the alert.'
  • for: The duration that the expr condition must be true for firing the alert to alert manager.
  • Expr: The PromQL expression that defines the condition triggering the alert.
  • Labels: Key-Value attached to the alert. Are used to categorize or add metadata with alerts, such as severity levels, alert types and actual metric value at the moment.
  • Annotations: Descriptive information attached to alerts that can include additional details or instructions such as summary and description.

Once the rule files are configured, locate them in the directory we've specified in Prometheus.yml, let's initiate the Prometheus server using docker compose

// docker-compose.yml
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ~/prometheus/rules:/etc/prometheus/rules
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-remote-write-receiver'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.retention.size=4GB'
    ports:
      - 9090:9090
    networks:
      - prom-net
    restart: always
Enter fullscreen mode Exit fullscreen mode

Now visit the prometheus dashboard on port 9090, alert rules should appear under alert tab if they were correctly implemented.

Remember whenever a rule was added or modified, we have to actively reload the prometheus server, the configuration reload is triggered by sending a SIGHUP to the Prometheus process or sending a HTTP POST request to the /-/reload endpoint (that's why the --web.enable-lifecycle flag is enabled).

Image description


Alertmanager & Pushgateway

After successfully configured our alert rule, it's time to trigger an actual alert from it, that's when Prometheus Alertmanager and Pushgateway come in.

Alertmanager :

Alertmanager handles the next steps for alerts, deduplicating, grouping, and routing these alerts to the correct receivers like email, Slack, or webhooks.

Pushgateway :

PushGateway offers a solution for supporting Prometheus metrics from batch jobs or ephemeral jobs that cannot be scraped. It acts as an intermediary service allowing these jobs to push their metrics to PushGateway, which can be scraped by Prometheus.
It allows us to actively push custom metrics by curling endpoint which is useful for testing every kind of threshold scenario before Prometheus could actually scrape the real data.


Configuration & Test Steps

Step 1: Configure alertmanager.yml & prometheus.yml

// alertmanager.yml
global:
  resolve_timeout: 5m //After this time passes, declare as resolve

route:   // Define the routing logic for each alert
  group_by: ['alertname', 'instance'] 
  group_wait: 10s  // The time to wait before sending the first notification for a new group of alerts
  group_interval: 10s  // After the first notification for a group of alerts, this is the wait time before sending a subsequent notification for that group
  repeat_interval: 1m  // How long Alertmanager waits before sending out repeat notifications for a group of alerts that continues to be active (firing).
  receiver: 'webhook-receiver' // Name of receiver, defined in receivers section
  routes:
    - match: // Defines the condition of triggering webhook
        severity: 'disk_low_crit'
      receiver: 'webhook-receiver'

receivers:
  - name: 'webhook-receiver'
    webhook_configs:   // Webhook server endpoint
      - url: 'http://10.0.0.1:3000'
        send_resolved: true

Enter fullscreen mode Exit fullscreen mode
// prometheus.yml  add alertmanager config
# my global config
global:
  scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - '/etc/prometheus/rules/*.yml'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:

static_configs:
  - targets: ["localhost:9090"]

Enter fullscreen mode Exit fullscreen mode

Step 2: Establish both instances with docker-compose, rebuild prometheus instance with new config as well

// docker-compose.yml
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ~/prometheus/rules:/etc/prometheus/rules
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-remote-write-receiver'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.retention.size=4GB'
    ports:
      - 9090:9090
    networks:
      - prom-net
    restart: always

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - '9093:9093'
    networks:
      - prom-net

  pushgateway:
  image: prom/pushgateway
  ports:
    - "9091:9091"
  networks:
    - prom-net
Enter fullscreen mode Exit fullscreen mode

Step 3: Pushing custom metrics to Pushgateway by running this command, in this case we're passing a temperature metric with value 75

echo "device_temperature_celsius{device=\"laser/101\", SN=\"18400138\", deviceId=\"3764b164-01e7-4940-b9e9-9cf26604534a\", instance=\"localhost:59900\", job=\"sysctl\"} 75" | curl --data-binary @- http://localhost:9091/metrics/job/temperature_monitoring

Step 4: Check Pushgateway dashboard on port 9091 to make sure if the metric was pushed

localhost:9091

Step 5: Check if the alert status changed to active on Prometheus dashboard (9090), it should meet the threshold of the alert rule we just defined

localhost:9090

Step 6: Check the alertmanager instance dashboard(9093) to make sure it receive the alert

localhost:9093


Reference:

Prometheus Docs
Prometheus git book

Thanks for reading !!!

💖 💪 🙅 🚩
benedev
Benny Yang

Posted on March 14, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related