Orchestrating Infrastructure Upgrades with Conductor

Authored by @marciojrtorres

Orkes Netflix Conductor is an OSS-based platform that helps in rapid application development. At Orkes, We manage numerous Orkes Netflix Conductor clusters for our customers who run their production workloads on this platform. It is our responsibility to keep their infrastructure up to date and healthy.

Recently, we encountered a challenge when it came to upgrading multiple EKS clusters to a newer version. Upgrading each cluster manually would have been time-consuming, error-prone, and inefficient in a broad sense.

To tackle this issue, we turned to our own product - Orkes Netflix Conductor, a powerful orchestration engine that enabled us to automate the entire upgrade process seamlessly across our numerous clusters. In this article, we will delve into the details of our implementation, the benefits we saw, and how this solution can help similar use cases.

Challenge: Upgrading Multiple EKS Clusters

As a company with a diverse range of clients and regions, we operate numerous clusters across all cloud providers. Many of these are AWS-based EKS (Kubernetes Engine) clusters. These clusters required an upgrade, including control plane and node group upgrades, as well as adhering to AWS's upgrade guidelines. Performing these upgrades manually would have been time-consuming, probably taking hundreds of hours and leaving room for errors that could impact customer workloads. During this upgrade, we also had to ensure the compatibility of dependencies, which added another layer of complexity to the process.

Another key aspect was we wanted to keep everyone informed each step of the way - this included our DevOps team, support team as well as our customers so that they are aware of what is happening and follow along.

If not for Conductor, what would have been the alternatives?

While tools like Terraform and other Infrastructure as Code (IaC) solutions can assist in managing or upgrading Kubernetes clusters, we realized that we needed a more comprehensive solution that could offer visualization and communication all along the process.

We needed a way to:

Conduct health checks on our application components at each step of the upgrade process, and stop the workflow to prevent an issue from escalating in a fail-fast design approach.
Keep stakeholders informed via Slack or email, especially for failure cases, for the DevOps team to act swiftly and mitigate any issues that may arise.
And verify the correct functioning of the clusters.

Implementation

Let's delve into the key aspects of our implementation:

Workflow design

To handle this use case, we developed a workflow representing the steps involved in upgrading our EKS clusters. This workflow provided us with a clear overview of the process, helping us identify potential bottlenecks and fine-tune our strategy visually. See a simplified version of the diagram below:

Note that this diagram shows an execution of the workflow and not the definition, but the definitions would be the same without all the green checks showing successful execution. The diagram is also slightly modified for this blog (otherwise, it would have been very long). As you can see, several steps between the flows are health checks, and several steps are about notifying teams.

Implementation Steps

The next step was implementing the workflow steps we needed in our code. Many of the steps were previously implemented for other use cases, so we could just reuse that easily. For example, we have the health_verification as a generic task usable in any workflow that may cause a service disruption.

Step-by-Step Execution

We divided the upgrade process into discrete steps, allowing Conductor to orchestrate the execution. Each step involved upgrading different components of the EKS cluster, such as the Control Plane or the Node Group. Conductor also ensured that these steps were executed in the correct order, guaranteeing a smooth and efficient upgrade. Though these sequential tasks represent the happy path, we had a failure workflow that would be triggered in case of any issue along the process.

Enhanced Monitoring and Error Handling

One of the significant advantages of using Orkes Netflix Conductor was its visibility into the upgrade process. We integrated Conductor with our communication channels, such as Slack and email, to inform relevant stakeholders about the progress. Moreover, Conductor's error handling capabilities enabled us to automatically retry transient errors, minimizing potential disruptions and ensuring a successful upgrade.

Here is our team cheering along to the updates:

Summary

By utilizing Netflix Conductor to automate our EKS cluster upgrades, we realized several key benefits:

Time and Effort Savings: Automating the upgrade process with Conductor saved us hundreds of hours that would have otherwise been spent on manual upgrades. This allowed our DevOps team to focus on more strategic tasks, improving overall productivity.

Error-Free Execution: With Conductor orchestrating the upgrade steps, the chances of human error were significantly reduced. The consistent and reliable execution ensured that each cluster was upgraded correctly and efficiently, eliminating potential issues that could arise from manual intervention.

Retry Mechanism for Transient Errors: Netflix Conductor's automatic retry mechanism for transient errors was instrumental in maintaining the upgrade's integrity. By automatically retrying failed steps, we could just sit back and get notified as each cluster was upgraded smoothly.

If you are looking forward to solving similar issues with Conductor, get in touch with us now.

We offer the cloud version - Orkes Conductor, on all prominent cloud platforms AWS, Azure & GCP.

If you want to try out Conductor for free, check out Orkes Playground - a free tool from Orkes.

Blog