Proactive Kubernetes Monitoring with Alerting
mikeyGlitz
Posted on June 15, 2022
Last year I had written about the importance of service monitoring and observability.
While service monitoring is important, service monitoring requires active involvement from application maintainers. When supporting applications and services which service real users, it's important to become aware of problems when they first happen.
The longer your service is down, experiencing errors or performance issues, or otherwise buggy, the more negative an experience you provide for your users.
Alerting provides application developers a passive mechanism for becoming aware of problems involving their applications.
Instead of being informed by users that a problem is occurring in an application, developers receive automated alarms which inform them a problem is occurring.
Kubernetes Alerting
Alerting isn't natively provided in Kubernetes. Fortunately there are a few open source applications which can satisfy the alerting needs. As mentioned in my previous post, Linkerd provides many monitoring tools out-of-the box, but some customizations need to be made in order to provide alerting.
- Prometheus is used to provide metrics monitoring (Memory usage, CPU usage, Network usage, etc.)
- Alertmanager is used to create and manage alarms using Prometheus metrics
- Grafana A visualization application which displays graphs and charts based on data received from metrics. Grafana also provides alerting.
- Prometheus Operator a project which expresses Prometheus services, rules, and alarms as Kubernetes Custom Resource Definitions (CRDs). Prometheus Operator is a Kubernetes-idiomatic way of declaring Prometheus services.
Alerting Channels
Before going through the process of setting up alarms, it is important do decide how you want to be notified.
For this tutorial, I'll be setting up email alerts.
Simple Mail Transfer Protocol (SMTP) is the service which allows applications, in this case our monitoring services, to send emails to the people supporting the applications. By default, most Internet Service Providers (ISPs), Email Providers, and Cloud Providers block the port that SMTP sends emails on.
Setting up a SMTP Relay
A SMTP Relay is a service which can be used as a proxy for SMTP traffic when you want to send emails external to your Local Area Network (LAN). I've researched a few SMTP relays and ultimately decided on Dynu.
Dynu won me over through providing easy-to-use documentation on how to set up their SMTP relay service. Also at $9/year, the service is not unaffordable.
After opening an account with Dynu, my next step was to deploy the SMTP service onto Kubernetes.
Since I've started working with Kubernetes, I've discovered the convenience of working with Helm charts. ArtifuctHub is now my go-to for finding quick and easy helm charts to install onto Kubernetes.
I decided to deploy SMTP using the bokysan postfix chart.
I started out with these initial values.yml
for the helm chart:
config:
general:
TZ: America/New_York
LOG_FORMAT: json
RELAYHOST: "{{ relay_host }}"
ALLOWED_SENDER_DOMAINS: "cluster.local {{ domain_name }}"
secret:
RELAYHOST_USERNAME: "{{ relay_username }}"
RELAYHOST_PASSWORD: "{{ relay_password }}"
I installed the helm release using Ansible, but you can install with the following helm commands:
helm repo add docker-postfix https://bokysan.github.io/docker-postfix/
helm install --values values.yml --namespace=mail --create-namespace sender docker-postfix/mail
Forking the Postfix Base Image
Prometheus Alertmanager is built on Golang. A limitation of the SMTP integration in Alertmanager is that TLS is required for remote connections.
I attempted to use the certs.create
value from the helm chart values, but the resulting certificate was not accepted by Alertmanager.
As a result, I had to create a new image from bokysan's Postfix image.
The new image detects for the following files to be mounted in /mnt/certs
:
tls.crt
tls.key
ca.crt
If the files are present, they are copied into /etc/ssl
and the postfix configuration at /etc/postfix/main.cf
is updated to enable the SSL settings.
This set up is useful if using cert-manager to set up certificates using Kubernetes Secrets.
⚠️ When using the Postfix image, please note that the signing algorithm for your certificates needs to be RSA
I'm still able to use Bokysan's Helm chart with the following values:
image:
repository: mikeyglitz/postfix
tag: latest
extraVolumes:
- name: tls-cert
secret:
secretName: mail-tls
extraVolumeMounts:
- name: tls-cert
mountPath: /mnt/certs
readOnly: true
config:
general:
TZ: America/New_York
LOG_FORMAT: json
RELAYHOST: "{{ relay_host }}"
ALLOWED_SENDER_DOMAINS: "cluster.local {{ domain_name }}"
secret:
RELAYHOST_USERNAME: "{{ relay_username }}"
RELAYHOST_PASSWORD: "{{ relay_password }}"
Installing Prometheus Operator
Prometheus Operator is a suite of applications leveraging the Operator Pattern for managing Prometheus applications in a way that is idiomatic for Kubernetes.
Configurations, Prometheus instances, and Alertmanager instances are managed using Custom Resource Definitions.
Before we can create the Prometheus Stack, we need to create a namespace where the prometheus resources will live
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
annotations:
linkerd.io/inject: enabled
Installation of the Prometheus Operator is handled using the Prometheus Operator Helm Chart.
The Prometheus Operator Helm Chart installs the entire Prometheus Stack:
- Prometheus
- Alertmanager
- Grafana
- Prometheus-Operator
- Prometheus NodeExporter
The values.yml
for the chart is quite extensive and will take up a lot of space in the subsequent sections. I've broken down the values that go into the helm chart so that they can be better comprehended.
With Linkerd, additional customizations need to be made to ensure that the Linkerd dashboards work by installing the correct Prometheus rules. This can be accomplished with the following values.yaml
Prometheus Operator Values
Since I'm using Linkerd as my service mesh, and have created the monitoring namespace where I will be installing the Helm Chart using the linkerd.io/inject
annotation, I will need to set the values for the Prometheus Operator so that the webhook is skipped from linkerd injection.
# Configure Prometheus -- we need to skip Linkerd injection or
# the operator will not install
prometheusOperator:
admissionWebhooks:
patch:
podAnnotations:
linkerd.io/inject: disabled
certManager:
enabled: "true"
issuerRef:
name: monitoring-issuer
kind: Issuer
This configuration also utilizes cert-manager to inject certificates into the Prometheus Webhook. The certificates generated by the monitoring-issuer
would be signed by our Root CA.
Prometheus Values
The default prometheus configuration must be modified to include custom scrapers so that Prometheus can export metrics to Linkerd. Without this configuration, the Linkerd Dashboard will not function properly.
prometheus:
prometheusSpec:
evaluationInterval: 10s
scrapeInterval: 10s
scrapeTimeout: 10s
resources:
requests:
memory: 4Gi
additionalScrapeConfigs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'grafana'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['monitoring']
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_container_name
action: keep
regex: ^grafana$
# Required for: https://grafana.com/grafana/dashboards/315
- job_name: 'kubernetes-nodes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [__name__]
regex: '(container|machine)_(cpu|memory|network|fs)_(.+)'
action: keep
- source_labels: [__name__]
regex: 'container_memory_failures_total' # unneeded large metric
action: drop
- job_name: 'linkerd-controller'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- 'linkerd'
- 'monitoring'
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_container_port_name
action: keep
regex: admin-http
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: component
- job_name: 'linkerd-service-mirror'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_linkerd_io_control_plane_component
- __meta_kubernetes_pod_container_port_name
action: keep
regex: linkerd-service-mirror;admin-http$
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: component
- job_name: 'linkerd-proxy'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_container_port_name
- __meta_kubernetes_pod_label_linkerd_io_control_plane_ns
action: keep
regex: ^linkerd-proxy;linkerd-admin;linkerd$
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# special case k8s' "job" label, to not interfere with prometheus' "job"
# label
# __meta_kubernetes_pod_label_linkerd_io_proxy_job=foo =>
# k8s_job=foo
- source_labels: [__meta_kubernetes_pod_label_linkerd_io_proxy_job]
action: replace
target_label: k8s_job
# drop __meta_kubernetes_pod_label_linkerd_io_proxy_job
- action: labeldrop
regex: __meta_kubernetes_pod_label_linkerd_io_proxy_job
# __meta_kubernetes_pod_label_linkerd_io_proxy_deployment=foo =>
# deployment=foo
- action: labelmap
regex: __meta_kubernetes_pod_label_linkerd_io_proxy_(.+)
# drop all labels that we just made copies of in the previous labelmap
- action: labeldrop
regex: __meta_kubernetes_pod_label_linkerd_io_proxy_(.+)
# __meta_kubernetes_pod_label_linkerd_io_foo=bar =>
# foo=bar
- action: labelmap
regex: __meta_kubernetes_pod_label_linkerd_io_(.+)
# Copy all pod labels to tmp labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
replacement: __tmp_pod_label_$1
# Take `linkerd_io_` prefixed labels and copy them without the prefix
- action: labelmap
regex: __tmp_pod_label_linkerd_io_(.+)
replacement: __tmp_pod_label_$1
# Drop the `linkerd_io_` originals
- action: labeldrop
regex: __tmp_pod_label_linkerd_io_(.+)
# Copy tmp labels into real labels
- action: labelmap
regex: __tmp_pod_label_(.+)
Grafana Options
ℹ️ At the time of writing, using an external instance of Grafana
is only supported by an upcoming release of Linkerd 2.12.
Linkerd 2.12 can only be found on the edge branch, not the stable.
Subsequent sections will cover how to adjust the Linkerd install
to support an external Grafana instance in more detail.
Grafana needs to be updated to pre-install the Linkerd dashboard.
As per the Monitoring and Observability post, Loki also needs to be added as a data source for Grafana.
To support Grafana's Alerting Functionality Alertmanager can be added as a data source so that Alertmanager rules will appear in grafana.
ℹ️ At the time of writing, Alertmanager is only supported in Grafana plugins alpha.
Grafana plugins alpha can only be added in the grafana configuration via theplugins.enable_alpha
option ingrafana.ini
.
# Grafana options -- pre-install Linkerd dashboards
# and configure datasources
grafana:
grafana.ini:
server:
root_url: '%(protocol)s://%(domain)s:/grafana/'
auth:
disable_login_form: false
auth.anonymous:
enabled: true
org_role: Editor
auth.basic:
enabled: true
analytics:
check_for_updates: false
panels:
disable_sanitize_html: true
log:
mode: console
log.console:
format: text
level: info
plugins:
enable_alpha: true
smtp:
enabled: true
from_address: grafana@<domain>
host: sender-mail.mail-sender:587
skip_verify: true
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
# Logging dashboard - https://grafana.com/grafana/dashboards/7752
logging:
gnetId: 7752
revision: 5
datasource: prometheus
# all these charts are hosted at https://grafana.com/grafana/dashboards/{id}
top-line:
gnetId: 15474
revision: 3
datasource: prometheus
health:
gnetId: 15486
revision: 2
datasource: prometheus
kubernetes:
gnetId: 15479
revision: 2
datasource: prometheus
namespace:
gnetId: 15478
revision: 2
datasource: prometheus
deployment:
gnetId: 15475
revision: 5
datasource: prometheus
pod:
gnetId: 15477
revision: 2
datasource: prometheus
service:
gnetId: 15480
revision: 2
datasource: prometheus
route:
gnetId: 15481
revision: 2
datasource: prometheus
authority:
gnetId: 15482
revision: 2
datasource: prometheus
cronjob:
gnetId: 15483
revision: 2
datasource: prometheus
job:
gnetId: 15487
revision: 2
datasource: prometheus
daemonset:
gnetId: 15484
revision: 2
datasource: prometheus
replicaset:
gnetId: 15491
revision: 2
datasource: prometheus
statefulset:
gnetId: 15493
revision: 2
datasource: prometheus
replicationcontroller:
gnetId: 15492
revision: 2
datasource: prometheus
prometheus:
gnetId: 15489
revision: 2
datasource: prometheus
prometheus-benchmark:
gnetId: 15490
revision: 2
datasource: prometheus
multicluster:
gnetId: 15488
revision: 2
datasource: prometheus
additionalDataSources:
- name: alertmanager
type: alertmanager
url: http://metrics-kube-prometheus-st-alertmanager:9093
access: proxy
orgId: 1
jsonData:
implementation: prometheus
- name: loki
type: loki
access: proxy
default: false
editable: true
url: http://loki:3100
maximumLines: "300"
orgId: 1
jsonData:
manageAlerts: true
alertmanagerUid: alertmanager
Configuring Alertmanager
The final piece of the puzzle is configuring Alertmanager.
Alertmanager manages Prometheus alerts and is responsible for forwarding messages to the various alert receivers. Majority of the values below were extracted from the default configuration that the Alertmanager helm chart creates.
The notable differences are the root route. By default the root route doesn't allow a matcher. If we attached a receiver to the root route, we will be constantly notified. The constant notifications would trigger spam blockers to block our Alertmanager emails.
Under the root route, we create a nested routes
field. The routes
field is set up so that based on the matchers
criteria we will receive an alert via email. In this case our alert will trigger whenever there is an alert where its severity is critical.
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 15s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- group_by: ['alertname']
group_wait: 15s
group_interval: 10s
repeat_interval: 12h
matchers:
- severity="critical"
receiver: email
receivers:
- name: 'null'
- name: email
email_configs:
- to: <recipient@mail.com>
from: alertmanager@haus.net
smarthost: sender-mail.mail-sender:587
require_tls: true
tls_config:
insecure_skip_verify: true
templates:
- '/etc/alertmanager/config/*.tmpl'
alertmanagerSpec:
externalUrl: https://monitoring.haus.net/alarms
logFormat: json
alertmanagerConfigNamespaceSelector:
matchLabels:
alertmanagerconfig: enabled
alertmanagerConfigSelector:
matchLabels:
role: alertmanager-config
Installing the chart
The chart may be installed with the following command:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install --values values.yml --namespace monitoring metrics prometheus-community/prometheus-stack
Configuring Linkerd
In my previous post, I had leveraged kustomize to install Grafana with an additional data source.
This time around I'll be installing the linkerd-viz helm chart
First I begin by installing the edge repo. The current stable branch of Linkerd does not support bringing your own Grafana instance to linkerd
source: https://linkerd.io/2.11/tasks/grafana/
These notes apply only to recent Linkerd Edge releases and the upcoming Linkerd 2.12 stable release, which have stripped off the embedded Grafana instance, recommending users to install it separately as explained below.
helm repo add linkerd-edge https://helm.linkerd.io/edge
Based on the values.yml from the helm chart, I can set the update the grafana settings with the grafanaUrl
parameter. I also set up Jaeger so that I can visualize application tracing.
values.yml
jaegerUrl: jaeger.linkerd-jaeger:16686
prometheusUrl: http://metrics-kube-prometheus-st-prometheus.monitoring:9090
grafanaUrl: metrics-grafana.monitoring:80
# Since we're bringing our own Prometheus and Grafana instances,
# we have to disable the embedded Prometheus and Grafana instances
prometheus:
enabled: false
grafana:
enabled: false
Install the linkerd-viz helm chart with the following command:
helm install --values values.yml --namespace linkerd-viz --create-namespace linkerd-viz linkerd-edge/linkerd-viz
Once the install completes, you will be able to access Grafana through Linkerd.
Alerts and Monitoring with Logging Operator
I use the Banzaicloud Logging Operator to provide idiomatic configurable logging from applications in my Kubernetes cluster to Grafana Loki. Logging Operator can provide Service Monitors and alarms for Logging instances with the following configuration:
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
name: my-logger
namespace: my-namespace
spec:
fluentd:
metrics:
serviceMonitor: true
prometheusRules: true
fluentbit:
metrics:
serviceMonitor: true
prometheusRules: true
The serviceMonitor
tag sets up a Prometheus Service monitor.
The prometheusRules
set up the Prometheus rules for alerting on certain thresholds.
The default alerting rules trigger alerts when:
- Prometheus cannot access the Fluentd node
- Fluentd buffers are quickly filling up
- Traffic to Fluentd is increasing at a high rate
- The number of Fluent Bit or Fluentd errors or retries is high
- Fluentd buffers are over 90% full
Next Steps
Once everything has been set up, we should have a foundation for receiving alarms from our Kubernetes cluster.
From this point, we could enhance the alarms by creating additional rules, such as FluentD rules with the Logging Operator.
Perhaps you would like additional Alertmanager configurations to send alarms to messaging integrations such as Slack.
In my next post, I'll demonstrate how to set up an Alertmanager configuration to work with Discord/Guilded WebHooks.
References
Posted on June 15, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.