Scaling Kubernetes Clusters with Armory
Michael Bogan
Posted on October 21, 2021
Overview
Imagine if Amazon were unavailable for just one day each year. At that rate, they would have 99.7% availability, which seems fairly reasonable on the surface. In 2020, however, Amazon's revenue was nearly $400 billion. Living with 99.7% availability instead of 100% would cost Amazon more than $1 billion. Downtime—even just a little—will cost your business.
We minimize downtime through redundancy. Building reliable systems out of unreliable components requires having spare components to replace the faulty ones. Still, though, it takes time to replace faulty components.
Now, let's take it to the realm of software. You have an enterprise-grade system with tens of thousands (or more) of servers and applications. Those are spread across the globe. You need to deploy your applications and services to your servers, manage your fleet of infrastructure, upgrade, patch, and support development. In case of an incident, you need to be able to roll back changes safely or failover to alternate systems. You need to secure all these systems and manage access.
Kubernetes has risen to prominence over the last several years because it provides a solid, yet extensible, foundation to build such large-scale systems. For continuous delivery (CD) of large-scale systems, including those built on Kubernetes, the platform of choice is Spinnaker. However, if you were to use Spinnaker alone, managing Kubernetes deployments at this level of scale is untenable. What's needed is a robust tool for handling change management and pipeline configuration and security. Ultimately, what's needed is Armory's enterprise-level distribution of Spinnaker.
In this article, we'll consider the challenges of managing Kubernetes at scale and the hurdles of handling operational environments and upgrading Kubernetes. Then, we'll look at the role of Spinnaker as a CD platform for Kubernetes. Lastly, we'll look at the simplicity and safety that come with using Armory Spinnaker for the task of managing such large-scale Kubernetes systems.
Managing Kubernetes at scale
Kubernetes is great, however, it operates at the cluster level. And a Kubernetes cluster has limits. Kubernetes 1.21
supports up to 5,000 nodes per cluster and 110 pods per node, up to a total of 150,000 pods and a total of 300,000 containers. That's a lot, but most enterprises exceed even these limits by several orders of magnitude.
These cluster limits are unlikely to change anytime soon. For larger systems, the solution is to run multiple Kubernetes clusters. But, how sustainable is multi-cluster management without the support of good tools? In addition, we have the challenges of managing access control for users and services, coordinating multiple environments, and dealing with potentially system-breaking Kubernetes upgrades. At the level of scale we're considering—and for enterprises where downtime is unacceptable—the task is herculean. Let's look at each of these challenges in detail.
Multi-cluster management
First, let’s look at multi-cluster land. If you have a small number of clusters (let's say fewer than 10), you may try to manage them directly. Humans will follow a runbook to perform tasks like provisioning a cluster, deploying essential software, upgrading, and patching.
This kind of manual operation breaks pretty quickly. Kubernetes promotes the concept of treating your servers as cattle, not pets. At scale, you need to take this concept to your clusters as well. What does this mean? It means that clusters are disposable. A cluster that runs some workloads (including critical workloads) can be scaled (have more similar clusters running the same workloads), can be deleted, and/or can be drained, with its workloads transferred to another cluster.
That level of control at scale requires a lot of dedicated cluster management software. One of the most useful patterns is the management cluster. The management cluster is a Kubernetes cluster whose job is to manage other Kubernetes clusters. A good starting point for this approach is the Kubernetes Cluster API, which is a Kubernetes sub-project that deals precisely with this problem.
Users, service accounts, and permissions
Next, the issues of managing and controlling access when managing a large number of resources at scale. Kubernetes has support for concepts like users (humans) and service accounts (machines) and a role-based access control model (RBAC). Credentials and secrets need to be managed safely and rotated. Users and service accounts need to be provisioned and removed, and applications need to be onboarded and decommissioned. This type of management—even for a single cluster—is not trivial. When thinking about tens, hundreds, and even thousands of ephemeral clusters that come and go, the management problem can seem overwhelming.
There are several issues at stake here, which we'll touch upon briefly.
User experience
The user experience involves requesting the correct level of access on each cluster. Without proper support, it is not easy for users to know what access level they need for each cluster. Often, users will attempt to access some resource through a UI and encounter an obscure “access denied” error.
Operator experience
The operator experience involves managing a complicated graph of resources and users. Often, these are organized in hierarchical structures of resource groups and user groups. Operators need to be able to manage the correct permissions for each user/group as well as user membership within groups. These permissions need adjusting whenever new users join the organization, leave the organization, or change roles. All of this requires integration between developers, operators, and IT.
Blast radius
The blast radius of a breach is often related to the granularity of permissions management. If every developer has full admin rights, then one compromised laptop can lead to a total crisis. On the other hand, custom tailoring the permissions of each user can lead to a lot of busywork. Mistakes will leave people locked out of the resources they need to perform their jobs.
Security
Security issues are closely related. If resources are not protected correctly, then sensitive information might leak out. Attackers can take control over critical systems.
Once you add the simple reality that applications and automated systems also require their own permissions to the mix, the problems compound.
The bottom line is that this scale of management is impossible to achieve by clicking on buttons in some web console. The entire process must be dynamic, with auto-discovery and consistent guidelines across the enterprise.
Operational environments
Now let's look into one aspect of a global, large-scale system in more detail. Each workload doesn't live in a single environment. During the lifecycle of development, deployment, and maintenance, a workload will be used in different environments. For example, a common model is to have multiple deployment environments such as development, staging, and production. When dealing with multiple Kubernetes clusters, managing each of these environments becomes much more difficult.
The production environment is, of course, the most critical one. But often, the staging environment must adhere to very high standards of availability, mimicking the production environment because it serves as the gatekeeper to production deployment.
Earlier, we mentioned redundancy and failover. The system architecture and topology may dictate if your Kubernetes clusters are deployed as multi-zone, multi-region, or even multi-cloud to protect yourself from different failure modes. Redundancy is expensive; managing failover isn’t trivial either.
If your clusters run on a mix of cloud providers and private data centers, this adds yet another dimension of complexity.
Finally, your developers will appreciate a local development experience where they can test their changes before deploying them even to a shared development environment.
Upgrading Kubernetes
Okay. Your Kubernetes clusters are up and running. Everyone is happy. However, even now, your job isn't done yet. Kubernetes releases a new version three times a year (before, they would release new versions four times a year). It behooves you to upgrade frequently. This is especially true if you use a managed Kubernetes offering like GKE, EKS, or AKS. Your cloud provider will upgrade your Kubernetes clusters whether you like it or not.
Kubernetes has a very orderly deprecation and removal process. Every resource has a group, version, kind (GVK). First, a particular version will be deprecated. This means the version is still supported, but everyone knows it's on the chopping block to be completely removed at a later time. If you upgrade to a new version of Kubernetes, and you try to create or update a removed object, this will fail. Your cluster now contains broken resources.
To prevent this from happening, you must be vigilant, constantly updating all your Kubernetes resources to supported versions. A good practice is to update even deprecated resources early. You don't want to delay updates until the last minute, just before you upgrade to the Kubernetes version that no longer supports those resources.
How do you go about this? First, you would read the Kubernetes release notes to learn what resources will be deprecated or removed. Then, scan all your clusters for impacted resources. After identifying those resources, work with the stakeholders to update their resources to supported versions.
At scale, you will need tools to assess compatibility. I wrote two tools that I may open-source soon: One tool accepts a list of deprecated/removed resource versions and scans a list of clusters for resources that use these versions. The other tool takes a list of clusters and a target cluster and does a dry run import of all the original resources to the target cluster.
For example, if you want to upgrade from Kubernetes 1.18
to Kubernetes 1.19
, then you'll create a 1.19
empty cluster and run the tool with a list of all your 1.18
clusters. The tool will attempt to create (in dry-run mode) all the resources from the 1.18
clusters in the 1.19
cluster. Any incompatibility will cause an error that will be reported.
Once you're ready for the upgrade and all your workloads use the supported version, it's time to perform the actual upgrade itself. There are two parts to the upgrade process: upgrading the control plane and upgrading the data plane (worker nodes). The data plane version means the version of the Kubernetes components that are installed on each node (kubelet, container runtime, kube-proxy). Different nodes may have different versions, but they can't be older than two minor versions of the control plane nor newer than the control plane version.
That means that when you upgrade your cluster, first you must upgrade the control plane to a version that is at most two minor versions ahead of your oldest node version. For example, if your nodes are on Kubernetes 1.18
you can upgrade your control plane to 1.19
or 1.20
. Then, you can follow up and upgrade your nodes to the same version.
The best practice is to keep everything consistent—your control plane and all your nodes should all be on the same minor version of Kubernetes. The only exception is during the upgrade process itself; there will temporarily be some components on the old version and some on the new version.
All of this requires considerable coordination, especially if you run Kubernetes clusters on a combination of cloud providers and on-prem. Perhaps not all cloud providers support the same versions of Kubernetes. You will need to find a common ground or deviate from the best practice and run different versions of Kubernetes on different providers.
Has it become clear yet that managing Kubernetes at scale is a non-trivial task? Doing this well requires a lot of supporting software, most of which is not just available off the shelf. You will need to build your own tools, evolve them, and support advanced deployment scenarios.
Beyond the task of managing your Kubernetes clusters is the colossal task of deploying and scaling; and this is where Spinnaker comes in.
Spinnaker as a CD Platform for Kubernetes
While there are several powerful and popular continuous delivery solutions for Kubernetes, Spinnaker has some unique features especially for enterprises that need to deal with a large number of Kubernetes clusters.
First of all, Spinnaker has been battle-tested at scale. It was developed originally by Netflix and then extended by Google. Second, Spinnaker supports multiple providers and not just Kubernetes. Not being Kubernetes-specific may seem like a disadvantage, however, no company with thousands of Kubernetes clusters has only those clusters. Companies that operate at this scale have been around long before Kubernetes came about (which was just six years ago). They typically have a large portfolio of systems deployed in non-Kubernetes environments. Spinnaker supports their Kubernetes environments as well as their non-Kubernetes environments.
In addition, Spinnaker is technically very powerful, supporting multiple models of deployments, which is necessary for managing a large variety of services and applications.
Finally, Spinnaker—while being an open-source project—also enjoys strong commercial support from companies like Armory.
Simplicity and Safety with Armory Spinnaker
While Spinnaker gets our foot in the door in regard to large-scale deployments, for enterprises this is just the beginning. In enterprises running systems of this scale, it's not just one DevOps guru running all things Spinnaker. There are entire DevOps teams (plural) all touching Spinnaker pipelines at the same time. Spinnaker alone does not afford the policy management and pipeline change management that these enterprises need. To meet this need, Armory has developed its enterprise-level distribution of Spinnaker. Armory extends Spinnaker in very significant ways to make it a better Kubernetes citizen, especially at scale.
First, Armory builds on top of Vanilla Spinnaker using Spinnaker's own plugins. This is crucial for sustainable integration. About a decade ago, I worked for a company that built a social browser based on Chromium. The project was very innovative and successful, but it involved deep customization of Chromium. Whenever Chromium released a new version, we had to reapply all of our customizations on top of it. Whenever a security fix was disclosed, we were vulnerable until we finished the integration, which took days (in some cases, weeks). In this regard, Armory takes the high road, using the standard mechanisms for contributing a plugin framework to open source Spinnaker. Put simply, Armory builds on top of open-source Spinnaker.
Second, Armory is also dedicated to Kubernetes, focused on making Spinnaker a better Kubernetes citizen. This starts with the Armory Agent for Kubernetes. The agent allows distributed deployments to thousands of clusters and decentralized account management. In addition, Armory provides a policy engine that allows the setting of organizational policies, increasing safety and compliance.
If that's enough, Armory throws in some other goodies like terraform integration and secrets management. But, arguably the most important capability for managing Kubernetes at scale is the Armory's "Pipelines as Code" feature. GitOps for Spinnaker pipelines is a crucial feature. Keeping your Spinnaker pipeline as code in source control and using your standard review and change management makes a huge difference.
Conclusion
Kubernetes solves many problems for modern container-based distributed applications. When you migrate enterprise-scale systems to Kubernetes, you might end up managing a large number of Kubernetes clusters. To do this well, you'll need a solid solution for deploying your workloads to all these clusters as well as managing the clusters themselves. Spinnaker and Armory can serve you well. If you manage a large number of Kubernetes clusters or plan to migrate large existing systems to Kubernetes, you'll likely find Spinnaker and Armory as your go-to continuous deployment solution.
Posted on October 21, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 25, 2020