Kubernetes Service Monitoring and Observability

mikeyglitz

mikeyGlitz

Posted on January 3, 2021

Kubernetes Service Monitoring and Observability

A common problem that I've run across throughout my career is supporting applications which have gone to production. When supporting a service in production, it is important to be able to identify where things went wrong and how to fix them before customers/end users are impacted.
Application logs are usually first place that I usually check when I'm notified of a production issue. Seems simple enough right? Open a file look for the words "exception" or "error" and backtrack from there. The only problem is that this approach to troubleshooting becomes unsustainable in today's environments where services can be comprised of decentralized, distributed, constituent applications.
How would you go about monitoring multiple services (i.e. micro-service architecture) without the increased cognitive load that comes with complicating the troubleshooting process?

Monitoring and Kubernetes

Although there are multiple ways of performing monitoring which can be dependent on which platform you're using and what tools are available for your platform (i.e. GCP, Azure, AWS), I'm going to be focusing on how to perform monitoring on a Kubernetes cluster.
Fortunately when working in the context of a Kubernetes cluster, there's a fair bit of open source projects which can aid in providing observability and monitoring for services.
For the ease of use, I'll be using Linkerd. Linkerd is a service mesh and is fairly easy to set up. Additionally, with Linkerd, you get Prometheus and Grafana included for FREE!

What is a Service Mesh?

In the context of a Kubernetes cluster, a service mesh is a collection of applications, services, and custom resources which provide observability, scalability, and resiliency for applications in your cluster. Linkerd specifically uses a pod-sidecar called Linkerd Proxy which proxies API calls to your services and provides metrics back to Linkerd. These metrics are reported to Prometheus which is a tool for viewing metrics and managing alerts. Prometheus reports metrics to Grafana which provides visualizations of these metrics in the form of charts and graphs which makes the metrics easier to digest.
Additionally, Grafana has a plugin, Loki which is designed to feed logs to Grafana allowing you to search through logs with relative ease.

The flow for application metrics will resemble the following:
Metrics Flow

Additionally, an application log flow will look like the following:
Application Logging Flow

Setting up Linkerd

Assuming you have your Kubernetes cluster all set up, setting up Linkerd will be relatively straightforward. Linkerd provides a command-line interface (CLI) tool which makes managing Linkerd pretty easy.

Install the CLI following these instructions:

curl -sL https://run.linkerd.io/install | sh
# Add Linkerd to path
export PATH=$PATH:$HOME/.linkerd2/bin
# Deploy Linkerd to your cluster
linkerd install | kubectl apply -f -
Enter fullscreen mode Exit fullscreen mode

From here, you'll be able to access Grafana from the Linkerd Dashboard

# Open a browser window to the linkerd dashboard
linkerd dashboard
Enter fullscreen mode Exit fullscreen mode

Kubernetes resources can be assigned to Linkerd using annotations to either a Deployment or Namespace:

apiVersion: v1
kind: Namespace
metadata:
  linkerd.io/inject: enabled
Enter fullscreen mode Exit fullscreen mode
apiVersion: apps/v1
kind: Deployment
metadata:
  linkerd.io/inject: enabled   
Enter fullscreen mode Exit fullscreen mode

Logging Operator

Logging Operator is a project created by Banzai Cloud which is powered by FluentD and FluentBit in order to perform log discovery.

We begin by deploying the logging operator onto our Kubernetes cluster. Fortunately, there's a helm chart which makes the deployment easier.

# Create a logging namespace
apiVersion: v1
kind: Namespace
metadata:
  name: logging
Enter fullscreen mode Exit fullscreen mode
# Add the helm repo
helm repo add banzaicloud-stable https://kubernetes-charts.banzaicloud.com
# Install the helm chart
helm upgrade --install --wait --create-namespace --namespace logging logging-operator banzaicloud-stable/logging-operator \
  --set createCustomResource=false"
Enter fullscreen mode Exit fullscreen mode

Deploying Loki

Now that Logging Operator is present to set up log discovery across our various sources, it's time to set up Loki and Grafana.
Grafana community has set up a helm chart which will help in building a Loki stack using terraform:

resource "helm_release" "rel_logging_loki" {
  repository = "https://grafana.github.io/helm-charts"
  chart = "loki-stack"
  name = "loki"
  namespace = "logging"

  set {
    name = "pomtail.enabled"
    value = "true"
  }
  set {
    name = "loki.enabled"
    value = "true"
  }
}
Enter fullscreen mode Exit fullscreen mode

Linkerd will need to be updated so that Grafana talks to Loki.
To preform the set up, we'll use a built-in Kubernetes utility kustomize in order to patch the new Grafana configuration into Linkerd's Grafana instance.

grafana.yml

kind: ConfigMap
apiVersion: v1
metadata:
  name: linkerd-grafana-config
data:
  datasources.yaml: |-
    apiVersion: 1
    datasources:
    - name: prometheus
      type: prometheus
      access: proxy
      orgId: 1
      url: http://linkerd-prometheus.linkerd.svc.cluster.local:9090
      isDefault: false
      jsonData:
        timeInterval: "5s"
      version: 1
      editable: true
    - name: Loki
      type: loki
      access: proxy
      editable: false
      default: true
      url: http://loki.logging:3100
      maximumLines: "300"
Enter fullscreen mode Exit fullscreen mode

Set up a kustomization.yml:

resources:
- linkerd.yml
patchesStrategicMerge:
- grafana.yml
Enter fullscreen mode Exit fullscreen mode

Now we can dump our current Linkerd config and run kubectl kustomize to patch the Grafana configuration from the previous step into Linkerd

linkerd upgrade > linkerd.yml
kubectl kustomize | kubectl apply -f -
Enter fullscreen mode Exit fullscreen mode

Setting up Logging Operator to Stream to Loki

The last step we'll have to complete is setting up the logging operator to stream to Loki.
The Logging operator specifies the following custom resources which are used to watch containers and transport logs to a target destination:

Logging Operator Flow

  • Logging - Specifies a logging source
  • Output - Specifies a destination for log outputs. These resources can also be established cluster-wide as a ClusterOutput
  • Flow - Connects Logging resources to Output resources and specifies patterns which are used to parse log entries. These resources can also be established cluster-wide as a ClusterFlow.

Specify a ClusterFlow to send logs to Loki:

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: loki-output
  namespace: logging
spec:
  loki:
    url: http://loki:3100
    configure_kubernetes_labels: true
    buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true
Enter fullscreen mode Exit fullscreen mode

For each Pod/Deployment you want to monitor, you're going to need to set up a Logging and a Flow:

# Set up the Logging object
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: files-logger
  namespace: files
spec:
  fluentd: {}
  fluentbit: {}
  controlNamespace: logging
---
# Set up the Flow object
apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
metadata:
  name: files-flow
  namespace: files
spec:
  globalOutputRefs:
  - logging-index-output
  filters:
    - tag_normaliser: {}
    - parser:
        remove_key_name_field: true
        reserve_data: true
        parse:
          type: multi_format
          patterns:
            - format: regexp
              expression: '/^(?<time>[^\]]*) \[(?<level>[^ ]*)\] (?<source>[^\":]*): (?<message>.*)$/'
              time_key: logtime
              time_format: '%Y-%m-%dT%H:%M:%S.%LZ'
            - format: regexp
              expression: '/^time="(?<time>[^\]]*)" level=(?<level>[^ ]*) msg="(?<message>[^\"]*)"/'
              time_key: time
              time_format: '%Y-%m-%dT%H:%M:%SZ'
            - format: regexp
              expression: '/^level=(?<level>[^ ]*) ts=(?<time>[^\]]*) caller=(?<source>.*) msg="(?<message>[^\"]*)"/'
              time_key: time
              time_format: '%Y-%m-%dT%H:%M:%S.%LZ'
            - format: regexp
              expression: '^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$'
              time_key: time
              time_format: '%d/%b/%Y:%H:%M:%S %z'
  match:
  - select:
      labels:
        app: owncloud
Enter fullscreen mode Exit fullscreen mode

ℹ The Flow object in this example uses FluentD expressions to parse log streams. The FluentD expressions can be evaluated/debugged using Fluentar

The globalOutputRefs section matches the Flow resource with an Output. The Output will route the logs to the indicated destination.

The match section indicates what Kubernetes resources the Flow resource will look for in order to obtain logs. In this example, the Flow resource will look for any pods with the label app: owncloud.

If everything is set up right, the logs will be able to be searched in Grafana:

Grafana Loki

What's Left?

So far I've covered how to set up Linkerd and stream logs to Grafana. An improvement to this set up would be deploying Prometheus AlertManager in order to create and manage alerts so that you can receive notifications through 3rd party services like Slack and PagerDuty.

References

Outtakes

This article was roughly 3 months of experimenting in my home-lab. Instead of Graphana and Loki, I had originally attempted to preform log streaming using the Elastic Stack, EFK (Elasticsearch, FluentD, and Kibana).
I'm using Keycloak as an identity provider and had attempted an initial configuration using Kibana and Open-ID Connect (OIDC) based authentication; however, the OIDC plugin is only available on the Platinum Tier of Elastic. Disabling the xpack.security.enabled setting broke the Elastic applications.
Attempting to utilize a keycloak-kibana plugin also caused Kibana to fail to start.

With Loki and Grafana, I'm able to set up a Kubernetes Ingress with OAUTH2 authentication as a way to secure access to my services.

💖 💪 🙅 🚩
mikeyglitz
mikeyGlitz

Posted on January 3, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related