garry
Posted on March 12, 2024
how to give product teams the autonomy to quickly provision infrastructure for their services, without the platform team losing control
context 📚
For many tech startups and scale-ups, the technology team usually evolves into some sort of central Platform team(s) (aka "DevOps", "Infrastructure", or "SRE", maybe some "Dev Ex") and a bunch of Product teams, who are constantly working to build new features and maintain the core product for customers.
Product teams want to release their work quickly, often, and reliably, but with the autonomy of not needing to wait for a Platform team to provision a database or other resource for them.
The Platform team, however, wants to be comfortable and confident that all of the infrastructure the company is running is stable, securely configured, and right-sized to reduce overspending on their cloud costs. They'll also want to keep infrastructure as code, with tools like Terraform, which the Product teams might not care to use or learn.
So how can the Platform team enable the Product teams to work efficiently and not be blocked, while not losing visibility or control over the foundations upon which they run?
first attempt at self-service infra 🧪
I wrote in my previous post about how the Platform team I worked on adopted GitOps and Helm to codify the deployment process, with the additional benefits of making deployments auditable and making Kubernetes clusters recoverable in disaster scenarios.
Once that migration was completed, we wanted to enable the many Product teams to have the independence to set up new micro-services without our involvement or the need to raise any tickets.
Our first attempt was to introduce other engineering teams to Terraform - the Platform team was already using it extensively with Terragrunt, and using Atlantis to automate plan
and apply
operations in a Git flow to ensure infrastructure was consistent. We'd written modules, with documentation, and an engineer would simply need to raise a PR to use the module and provide the right values, and Atlantis (once the PR was approved by Platform) would go ahead and set it up for them.
To us, this felt light touch. Product engineers wouldn't have to learn Terraform (Platform would own and maintain the Terraform modules), they'd just need to learn how to use those modules with Terragrunt and apply them with Atlantis. We wrote up docs, recorded a show and tell, ... profit?
Except not quite.
another tool to learn 🛠️
While a few engineers did start to use this workflow, and they appreciated getting more hands-on instead of raising tickets to have it done for them at some undetermined future date, most engineers didn't.
Whether it's a cultural thing, and teams don't want to care about the nuts and bolts under their service at runtime, or simply because they're on tight deadlines and don't have time to stop and work out Atlantis steps, ultimately, it doesn't really matter. As a Platform team, we're an enablement team, and the Product teams are our customers. What we had built was not serving their needs.
Given the team had already adopted GitOps and were familiar with deployments powered by Helm Releases and Flux, we wanted to move the provisioning of the infrastructure to be part of the same process of creating the service and its continuous deployment.
infrastructure as code as GitOps 🚀
We stumbled upon a project for maintaining Terraform with CRDs that we could deploy with Helm. That project is now called Tofu-Controller - another WeaveWorks project, so it integrated great with our existing Flux setup.
What it meant is that Product engineers could provision a database for their service, or any other per-service infrastructure they needed, from the Helm Release they already used to configure their service at runtime.
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: account-service
namespace: dev
spec:
chart:
spec:
chart: kotlin-deployment
version: 2.0.33
values:
replicas: 2
infrastructure:
postgres:
enabled: true
rdsInstance: account
The above (simplified) example shows that .Values.infrastructure.postgres.enabled
is true
. When the Helm chart is being installed or updated, then a Terraform
resource will be templated:
{{- if eq .Values.infrastructure.postgres.enabled true }}
---
apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
name: "postgres-{{ .Release.Name }}"
spec:
interval: 2h
approvePlan: auto
tfstate:
forceUnlock: auto
destroyResourcesOnDeletion: false
sourceRef:
kind: GitRepository
name: service-postgres
backendConfig:
customConfiguration: |
backend "s3" {
...
}
vars:
- name: service_name
value: "{{ .Release.Name }}"
- name: rds_instance_name
value: "{{ Values.infrastructure.postgres.rdsInstance }}"
Again, this code has been simplified for clarity, but there are a few things worth noting:
the
Terraform
resource can be configured to automatically plan and apply, and evenforceUnlock
some state locksthe 2 hour
interval
means the operator will continually try to re-apply the Terraform regularlythe
backendConfig
meant that the Terraform operator can share the remote state bucket with Atlantis; this is powerful since you can then reference across modules to get remote state outputsimportantly for stateful resources like databases, you can set
destroyResourcesOnDeletion
to false to avoid destroying data when you uninstall the helm chartwe can pass in the
vars
as usual to theservice-postgres
Terraform module; here we're passing in a name for the service (which maps to the database name and database user) and the name of the RDS instance on which to create it
When the above Helm chart was installed, it would create a CRD of the Terraform
kind, the operator will go and plan and apply the service-postgres
module with the vars
set as inputs. In this case, it'll create a database and user called account-service
on the account
RDS instance, and manage roles and grants, passwords, security group access, etc.
eventually consistent infrastructure ⏱️
Of course, the Terraform resources might take a short while to set up, so the service will need to handle scenarios where its Terraform dependencies might not exist yet. In practice though, we were able to provision databases, service accounts, S3 buckets, Kafka topics, and more within a few seconds and the service's pods would simply restart until they existed.
The Terraform operator will also continually apply the resources it's managing, so that helped us to avoid drift between what we expect to exist; it also fixes any situations where a user might manually change the infrastructure outside of code approvals.
split brain terraform 🧠
It's important to point out that this workflow is only for any per-service infrastructure; each Helm Release would provision services just for itself, and be managed by the operator.
The Platform team would continue to use Atlantis and the Terragrunt repo to manage the main cloud estate (VPCs, security groups, database instances, EKS, etc). The Platform team would also maintain the deployment Helm chart and the Terraform modules referenced by it.
The per-service Terraform modules could reference the remote state of those managed by Atlantis since they shared the same remote state S3 bucket. With the example above, by passing in the name of the RDS instance from the Terraform resource in Kubernetes, the operator can pull outputs from the instance's remote state when it was set up by Atlantis, based on whatever was needed.
next steps 🥾
In many ways, this made the HelmRelease in the GitOps repo a source of truth for the deployment; already describing how it should work at runtime, it was also now including some of its dependencies.
With time, our goal was to abstract the YAML from the repo into a service catalog like Backstage, where it would be easy to say "I want to create a service called X, it needs Postgres and Kafka", and its entire setup, including some boilerplate code and its infrastructure, could go off and be created automatically.
Posted on March 12, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.