Monitoring and Hardening the GitOps Delivery Pipeline with Flux

heubeck

Florian Heubeck

Posted on January 30, 2024

Monitoring and Hardening the GitOps Delivery Pipeline with Flux

The ultimate goal of every GitOps setup is complete automation. For being able to operate a system hands-off, its monitoring and alerting has to be reliable and comprehensive. In this blog, you will learn how to monitor a FluxCD operated GitOps setup on Kubernetes.

This article is the first of two accompanying articles to my talk about this topic on the Mastering GitOps conference.


About automation

With GitOps we try to eliminate any manual task from infrastructure handling and application operation. (Virtual) infrastructure is described declaratively and rolled out by Terraform or similar tools. Application manifests are templated and bundled as Helm charts or Kustomizations. And when using GitOps operators like Flux, all of these are sourced from VCS repositories (mostly Git) and pulled directly or indirectly (e.g. via Helm repositories or container registries) to Kubernetes.

There's nothing left to do for us but declare and observe, is it?
Well, when everything around configuration and application delivery is automated, so that no one has to get their hands dirty, there's no reason to stop automating after Kubernetes resources got (potentially) created.

In our early GitOps days back then in 2019 using Flux 1, we spent much time screening logs to find the reason of "nothing happened after change". There was Prometheus in place - then as now - but we were not happy with the provided default metrics. So we created a Kubernetes-Operator/Prometheus-Exporter thing, that attached to the log stream of chosen workloads to derive metrics from patterns applied to the log text. That was quite fun to build, but absolutely not our core business.
Fortunately the Flux stack nowadays has monitoring and alerting capabilities built in, exceeding the imagination of our former self.

Back to topic: When we're on-call duty, there should be no reason to do anything else but carry our mobile phones around. And during daily business, providing new application versions or configuration changes, it must not be necessary to observe their rollout. Any issue has to be reported, or even mitigated, by the system proactively. Silence means "everything's fine", not: "no idea what's happening".

As with GitOps there's no classic delivery pipeline that may fail, but continuous reconciliation of current and desired state, the main objectives of monitoring and hardening the setup are:

  • Validate desired state changes upfront
  • Detect and report any persistent deviation of current from desired state
  • Reduce the possible blast radius of problems

The first topic refers to automated checking and linting of changes targeting the desired state. That's vital, but concrete implementations are very much dependent on the setup itself. On pull/merge-requests, we use a combination of generic YAML linting, validation of Kubernetes manifests utilizing checkov and Helm chart-testing, and where possible real installations on Kind.

Subsequently, we'd like to focus on the aspects of monitoring and alerting at that point of time after configuration changes have been applied.

Flux resources and their status

Let's start with a brief overview on Flux' components, its custom resource definitions (CRDs) and their relations:

Image description

As very common, Kubernetes operators are configured by their appropriate custom resources (CRs). Hence Flux - aka the "GitOps Toolkit" - is composed from multiple operators they even exchange configuration via CRs. Not all relationships are visualized in this diagram as that's not required for its purpose.

One of the many things that I really like about Flux is its consequent and verbose reflection of errors as part of the Kubernetes resource status field. All operators write back the result of their actions to the respective CR.

A properly fetched Helm chart may look like:

Image description

Whereas a failed Helm release shows details on its status:

Image description

In addition to the build-in readiness of Flux' CRDs, more health checks can be included into the main CR of a Flux setup, the Kustomization:



apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: operators
spec:
  # (...)
  healthChecks:
    - kind: Deployment
      name: istio-operator
      namespace: istio-system
    - kind: Deployment
      name: prometheus-operator-kube-p-operator
      namespace: monitoring


Enter fullscreen mode Exit fullscreen mode

On reconciliation of a Kustomization, it gets ready by itself only if all referred healthCheck resources got ready, errors provided in its status:

Image description

The health checks are evaluated on every reconciliation of the Kustomization, so that the Kustomization will become un-ready if their fosterlings get ill. That's also reflected in its status, for instance:



---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
status:
  conditions:
  - lastTransitionTime: "2023-02-27T12:42:10Z"
    message: 'Applied revision: main@sha1:63ee003b889120646adcae3f5cfadcf23adecd13'
    observedGeneration: 11
    reason: ReconciliationSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-02-27T12:42:10Z"
    message: Health check passed in 63.22753ms
    observedGeneration: 11
    reason: Succeeded
    status: "True"
    type: Healthy


Enter fullscreen mode Exit fullscreen mode

Updating the diagram from above, there's a status on every CR, possibly propagated among them, helping us in analyzing issues:

Image description

Monitoring and reporting resource statuses

Given the knowings about resource statuses, we can automate the feedback on problems.

Our usual monitoring setup is made of the Kubernetes Prometheus stack for metrics collection and alerting rules, and Grafana for visualization.
Both of them are fed by Flux with resource status information and reconciliation events.

The metric, showing the readiness of all Flux CRs, is called gotk_reconcile_condition, its value 1 points to the active status and its label status stating True for ready and False for not ready:

Image description

This information can easily be included into your own Grafana dashboards, like shown in the dashboards provided by Flux:

Image description

That's for observability, but we're even more interested in alerts in case of any reconciliation issues. Using the Prometheus Operator for managing our Prometheus components, it may look like this (in its simplest form):



---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: flux-resources
  namespace: monitoring
spec:
  groups:
    - name: readiness
      rules:
        - alert: ResourceNotReady
          expr: 'gotk_reconcile_condition{type="Ready", status!="True"} == 1'
          annotations:
            description: '{{ $labels.kind }} {{ $labels.name }} in namespace {{ $labels.namespace }} is not ready'
          for: 2m
          labels:
            severity: critical


Enter fullscreen mode Exit fullscreen mode

During the reconciliation process, this alert will become "pending", so the for duration has to be chosen slightly longer than your resources reconciliation lasts. For quicker reactions, multiple rules with different selection criteria may be necessary.

The labels of the gotk_reconcile_condition metric provide all information required for determining the severity and maybe also the notification channel of a certain alert:

Image description

Usually, we're using different escalation paths depending on the severity of an alert. Everything of interest, that's not indicating an active incident is routed to team-internal chat, sometimes separated by technical or business relevance. Alerts that indicate customer impact and potentially affect multiple systems are routed to Opsgenie (on-call management system) causing service-degradation announcements and notifying the on-call duty person.

Prometheus Alertmanager offers a lot of integrations like for Opsgenie, and what's not supported first class can be connected with 3rd party components, for instance Prometheus MS Teams. Alerts from the PrometheusRule example above dispatched to MS Teams can look like:

Image description

Integrated observability

Although this already provides us with most of what we need for being able to react on problems, there are even more great features giving more insights, integrated into the tools we're using anyhow.

There's another Flux component alongside the others that emits detailed information about what's happening with the Flux managed resources: the notification controller:

Image description

It gets configured using Provider and Alert CRDs and supports a wide range of integrations, of those I'd like to show you my favorites.

Every change in our system originates from Git commits. The effects of a commit can be provided to the Git management system as commit status. The following example targets GitHub, but all major providers are supported.
Since Flux uses a SSH deploy key with GitHub, there's the need for an access token in addition, that is allowed to write commit statuses. This token is contained in the referenced secret, please see the Flux docs for more details:



---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: github
  namespace: flux-system
spec:
  type: github
  address: https://github.com/MediaMarktSaturn/software-supply-chain-security-gitops
  secretRef:
    name: github
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: commit-status
  namespace: flux-system
spec:
  providerRef:
    name: github
  eventSeverity: info
  eventSources:
    - kind: Kustomization
      name: cluster-config
      namespace: flux-system
    - kind: Kustomization
      name: operators
      namespace: flux-system
    - kind: Kustomization
      name: infrastructure
      namespace: flux-system
    - kind: Kustomization
      name: apps
      namespace: flux-system


Enter fullscreen mode Exit fullscreen mode

With this configuration, a commit status is attached for every Kustomization listed in the eventSources. The Alert CRD is used for all kinds of notifications, it contains the superset of features of the notification controller. For commit status notifications, only Kustomizations events can be used, as only those carry the commit reference.

This example configuration will result in commit status in GitHub looking as follows:

Image description

The description contains a summary of the Kustomization status and may already point to error causes.

Another very nice feature makes use of Grafana annotations. With a different provider, the notification controller creates annotations using the Grafana API for being included in any dashboard, indicating what has changed at a certain point in time:



---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: grafana
  namespace: monitoring
spec:
  type: grafana
  address: "http://prometheus-operator-grafana.monitoring/api/annotations"
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: grafana
  namespace: monitoring
spec:
  providerRef:
    name: grafana
  eventSeverity: info
  eventSources:
    - kind: GitRepository
      name: '*'
      namespace: flux-system
    - kind: Kustomization
      name: '*'
      namespace: flux-system
    - kind: HelmRelease
      name: '*'
      namespace: app
    - kind: HelmChart
      name: '*'
      namespace: app


Enter fullscreen mode Exit fullscreen mode

As you can see, there's support for all event sources, depending on the concrete source there is different metadata shown, commit hashes for GitRepository or Kustomization, but Helm chart version for HelmRelease.

Including annotations (that are basically just time markers) in any dashboard is quite easy:

Image description

But results in impressive vertical red lines in all panels with a time axis:

Image description

This way, all the visualizations are just connected right to the change events of the Kubernetes resources, causalities never will be overseen anymore.


These were just some examples on how to enhance observability and monitoring of a GitOps setup using Flux, I deeply recommend to browse the Flux documentation and release notes from time to time as there's much movement in this stack.

All of this eases operation of our GitOps setup, but for our business applications we need to go further, so read on.

Not hard to handle

How can we design our GitOps setup in a way that will make it most robust? There are many aspects to consider.
First, it's the human part: We have to make careless mistakes as hard as possible.
As stated in the very beginning, proposed changes to the desired state should be validated in an automated way for basic correctness. To enforce this, no changes to the single source of truth - meaning the Git branch, Flux pulls from - can be allowed without pull/merge-request. Having automated checks on this pull/merge request in addition to the four-eyes-principle should already find the worst reckless errors.
What's less obvious but also prone to break something are merge strategies other than fast-forward. Having auto-merges of files involved generates results, that could not be checked before. Although it may be rare cases in usual source code merges, isn't it always the same places that changes in Kubernetes manifests - causing higher probability of semantically wrong merge results?

Another strong opinion of mine, without discussing in detail right now, is about the right branching model for multi-stage GitOps repositories.
Yes, Kustomize allows for really sophisticated reuse of manifests. Even if my developer heart screams DRY (don't repeat yourself), I absolutely prefer one-separate-branch-per-stage repositories.
I know all the arguments against it, and I'm also aware of how to handle single-branch setups - but the risk of breaking production with changes only intended for a non-prod environment simply doesn't justify the potential benefits of a single branch. Beside that, I argue, that pull/merge-requests targeting production can be treated differently in terms of mandatory reviews or additional validations in a very easy way, compared to a single-branch setup.
There are many ways of making one-branch-per-stage handy, like templating differences as variable substitutions, packaging commonalities into Helm charts or extracting reusable Kustomizations into common repositories.
In fact, we're able to cherry-pick even complex changes from stage to stage, that feels easier and safer to me than patching Kustomizations.

Consistency and atomicity

Regarding technical terms for the second, it's about avoiding inconsistencies and side effects.
Flux' top level resource is the Kustomization. It provides us with many configuration options to reduce the blast radius of errors:

  • Kustomizations should dependOn each other. We can ensure that nothing happens to downstream Kustomizations, if there are errors with CRD creation, or mandatory infrastructure changes. In combination with the readiness of a Kustomization allowing us for indirect checks (installed operator came to life), changes fail early and don't propagate to healthy components. Also consider fresh bootstraps - without defined dependencies, newly set-up systems may not be able to succeed installation.
  • Critical resources can be protected from deletion using the prune setting. In general, recreation of resources is no issue - in fact, we're relying on it: Some application doesn't recover from an error? Delete it and let Flux recreate - "Turn it off and on again". But this is not true for everything. For instance, external managed services like load-balancers, IP addresses, SSL certificates would probably change when getting reconfigured implicitly. So we collect all of those kind of resources in a Kustomization that gets durable using prune: false.
  • Another rare, but breaking case is exactly the opposite: Resources should be recreated but are immutable like Jobs. Kustomizations with force: true get us covered, but this should be used with care to not obscure unexpected issues.

Avoiding error propagation is one thing, also Helm provides us with another very useful feature: Atomic updates. Using the --atomic flag with helm upgrade will either properly install the complete package, meaning all rendered manifests, or nothing - rolling back all changes to the last revision of that Helm release.
But Flux goes even further. Detailed instructions can be configured for handling failures. Installation or upgrade of a Helm release can be retried a given number of times, and remediation actions defined like uninstall on failed installation, rollback on failed update - contrary to the default behavior of leaving the Helm release in a failed state.
Depending on the content of a Helm chart, it can be necessary to automatically rollback failed upgrades on production - but leave it failed on non-prod for analyzing the issue.

What we actually benefit from is the most beloved but also most hated feature: Templating. On any error on rendering the manifests, the Helm upgrade fails and nothing is applied at all. Using plain manifests would result in an inconsistent state - even if this risk is much lower without templating at all 😉.

But Helm isn't just templating and versioning of manifests, it's a package manager, and during the installation of a package, aka chart, there are many checkpoints we can hook in. Using this, we can prepare for an application update (like updating database schemas), implement dedicated tests, but also use these hooks to validate and optionally force a Helm upgrade to fail and rollback before the entire system is damaged further.

What I want to say: Don't just throw your manifests in Git and let Flux reconcile until it succeeds or gets stuck, but carefully design your GitOps repository considering your systems' special needs. I admit, it's science as much as art, but as always with software: It's never finished or perfect, so it should be continuously revised.

Headless deployments

Eventually, we made it. All our infrastructure components and configurations are reliable, well observable and we're actively notified on issues that have to be taken care of.

In the next article we will elaborate on how to ensure the reliable, headless (business) application deployment using Flagger, happy reading.


get to know us 👉 https://mms.tech 👈

💖 💪 🙅 🚩
heubeck
Florian Heubeck

Posted on January 30, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related