Chaos Engineering 101: principles, process, and examples
Amanda Fawcett
Posted on November 2, 2020
As the web has grown increasingly complex alongside technologies like cloud computation, distributed systems, and microservices, system failures are harder to predict. To prevent outages, companies large and small have turned to chaos engineering as a solution.
Chaos engineering lets you predict and identify potential failures by breaking things on purpose. This way, you can find and fix failures before they become outages. Chaos engineering is a growing trend for DevOps and IT teams. Even companies like Netflix and Amazon use these principles in product development.
If you are new to chaos engineering, you’re in the right place. Today, we will introduce its principles in depth and show you how to get started with Kubernetes.
We will learn:
- What is Chaos Engineering?
- Chaos Engineering Tools
- Principles and Process of Chaos Engineering
- Chaos Engineering Example: Kubernetes Application
- What to learn next
Learn how to destroy your systems productively.
Learn the principles of chaos engineering with Kubernetes with this deep dive into chaos experiments, such as destroying a network, draining nodes, testing availability, and more.
The DevOps Toolkit: Kubernetes Chaos Engineering
What is Chaos Engineering?
Chaos engineering is a discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. With chaos engineering, we intentionally try to break our system under certain stresses to determine potential outages, locate weakness, and improve resiliency.
Chaos engineering is different from software testing or fault injection. Chaos engineering is used for all sorts of requirements and unpredictable situations, including traffic spikes, race conditions, and more.
With chaos engineering, we are trying to learn how an entire system reacts when an individual component is failing.
For example, chaos engineering can help answer functionality questions like these:
- What happens when a service is not accessible, one way or another?
- What is the result of outages when an application receives too much traffic or when it is not available?
- Will we experience cascading errors when a single point of failure crashes an app?
- What happens when our application goes down?
- What happens when there is something wrong with networking?
History: Chaos Engineering was first developed at Netflix in 2008 when their subscription streaming service was transitioned to the public cloud. Netflix’s engineers noted that they needed new ways of testing this system for resiliency.
Chaos Monkey was created in 2010 for that purpose. Since then, chaos engineering has grown, and companies like Google, Facebook, Amazon, and Microsoft have implemented similar testing models.
Benefits of Chaos Engineering
Chaos engineering offers many benefits that other forms of software testing or failure testing cannot. Failure tests can only examine a single condition in a binary breakdown. This doesn’t allow us to test a system under unprecedented or unexpected stresses.
Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos engineering, we can fix issues and gain new insights about an application for future improvements.
Chaos experiments help to reduce failures and outages while improving our understanding of our system design. Chaos engineering improves a service’s availability and durability, so customers are less disrupted by outages. Chaos engineering can also help prevent revenue losses and lower maintenance costs at the business level.
Chaos Engineering Tools
Before we start defining and running chaos experiments, we need to pick a tool. Chaos engineering is not yet a segment of the market that is well established and developed. Nevertheless, there are several tools we can pick from.
One of the most notable tools for chaos engineering is Simian Army, developed by Netflix. Simian Army is best for services in the cloud and AWS. It can generate failures and detect abnormalities. Chaos Monkey from Netflix is a resiliency tool for instances of random failures.
PowerfulSeal is a powerful tool for testing Kubernetes clusters, and Litmus can be used for stateful workloads on Kubernetes. Pumba is used with Docker for chaos testing and network emulation. Gremlin offers a Chaos Engineering platform that now supports testing on Kubernetes clusters.
Chaos Dingo is commonly used for Microsoft Azure, and Chaos HTTP Proxy can be used to introduce failures into HTTP requests.
Principles and Process of Chaos Engineering
As more teams have conducted experiments over the years, they’ve learned how to most effectively apply chaos engineering approaches to their systems. These best practices have become the core principles of chaos engineering. Let’s discuss the core principles of chaos engineering that every team should implement in their experiments.
Build a hypothesis around steady-state
You want to build a hypothesis around a steady-state behavior. Then, you want to perform potentially damaging actions on the network latency, applications, nodes, or any other component of the system.
You want to create violent situations to confirm that our steady-state hypothesis holds. you aim to validate that when our system is in a specific state, it performs certain actions, and finishes with the same validation to confirm that the state did not change.
Simulate real-world events
You want to do chaos engineering based on real-world events. In other words, only replicate events that are likely to happen in our system. This includes an application crash, network disruption will go down, or node failure.
Run experiments in production
You want to run chaos experiments in production. you want to experiment in production since that is the “real” system. If you perform chaos experiments only during staging or integration, you cannot get a real picture of how the system in production behaves.
Automate experiments and run them continuously
You want to automate our experiments to run continuously or be executed as part of continuous delivery pipelines. This could mean every hour, every few hours, every day, every week, or every time some event is happening in our system. You also want to run experiments every time you are deploying a new release.
Minimize blast radius
You should reduce the blast radius of our experiments. When you start with chaos experiments, you want to start small and build up as you gain confidence in a system. Eventually, you should do experiments across the whole system.
Summary of Principles
- Build a hypothesis around a steady-state
- Simulate real-world events
- Run experiments in production
- Automate experiments and run them continuously
- Minimize blast radius
Chaos Engineering Process
The general process for chaos engineering looks as follows:
- Define a steady-state hypothesis: You need to start with an idea of what can go awry. Start with a failure to inject and predict an outcome for when it is running live.
- Confirm the steady-state and simulate some real-world events: Perform tests using real-world scenarios to see how your system behaves under particular stress conditions or circumstances.
- Confirm the steady-state again: We need to confirm what changes occurred, so checking it again gives us insights into system behavior.
- Collect metrics and observe dashboards: You need to measure your system’s durability and availability. It is best practice to use key performance metrics that correlate with customer success or usage. We want to measure the failure against our hypothesis by looking at factors like impact on latency or requests per second.
- Make changes and fix issues: After running an experiment, you should have a good idea of what is working and what needs to be altered. Now we can identify what will lead to an outage, and we know exactly what breaks the system. So, go fix it, and try again with a new experiment.
Chaos Engineering Example
Now let’s apply all that theory to a simply real-world example to better understand chaos engineering. We will be using Kubernetes. To begin, we create a Kubernetes cluster. Then, we will deploy our simple application and destroy it. Then, we will show you how to define steady-states, which is crucial for chaos engineering.
Note: If you are new to Kubernetes, we recommend the course A Practical Guide to Kubernetes before continuing with chaos engineering. Or, you can follow along just to get an idea of how basic chaos engineering looks.
Create a Kubernetes Cluster
First, we need a Kubernetes cluster to destroy. You can choose Minikube, Docker Desktop, AKS, EKS, and GKE. Below, we use Docker Desktop to create a cluster. If you would like to learn how to create a cluster using the other tools, please refer to the course The DevOps Toolkit: Kubernetes Chaos Engineering.
# Source: https://gist.github.com/f753c0093a0893a1459da663949df618
####################
# Create A Cluster #
####################
# Open Docker Preferences, select the Kubernetes tab, and select the "Enable Kubernetes" checkbox
# Open Docker Preferences, select the Resources > Advanced tab, set CPUs to 4, and Memory to 6.0 GiB, and press the "Apply & Restart" button
#######################
# Destroy the cluster #
#######################
# Open Docker Troubleshoot, and select the "Reset Kubernetes cluster" button
# Select *Quit Docker Desktop*
From Viktor Farcic's demo in Educative's Chaos Engineering course
Clone and explore the repository
We need to deploy a demo application, which we’ve prepared below. We’re going to clone the repository vfarcic/go-demo-8
created by Viktor Farcic.
git clone https://github.com/vfarcic/go-demo-8.git
Next, we enter into the directory where we cloned the repository.
cd go-demo-8
git pull
Now, create a namespace called go-demo-8
.
kubectl create namespace go-demo-8
Now, let’s take a quick look at the application we’re going to deploy, located in the terminate-pods
directory, in a file called pod.yaml
.
---
apiVersion: v1
kind: Pod
metadata:
name: go-demo-8
labels:
app: go-demo-8
spec:
containers:
- name: go-demo-8
image: vfarcic/go-demo-8:0.0.1
env:
- name: DB
value: go-demo-8-db
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /
port: 8080
readinessProbe:
httpGet:
path: /
port: 8080
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 50m
memory: 20Mi
This app is defined as a single Pod with one container called go-demo-8
. It includes other resources like livenessProbe
and readinessProbe
.
Applying the definition to the cluster
Now, we apply that definition to our cluster inside the go-demo-8
Namespace. This will get our application up and running as a Pod.
kubectl --namespace go-demo-8 apply --filename k8s/terminate-pods/pod.yaml
Now it’s time to apply some damage and destroy our application!
Install the Chaos Toolkit Kubernetes Plugin
To perform chaos experiments to our application, we can use the Chaos Toolkit plugin for Kubernetes. This toolkit does not support Kubernetes out-of-the-box. We need a plugin for features beyond basic out-of-the-box features. Let’s install a Kubernetes plugin using pip
.
pip install -U chaostoolkit-kubernetes
Note: Explore the Chaos Toolkit plugin using the
discover
command to see all its features, options, and arguments.
Terminating Application Instances
Let’s start destroying stuff. Look at the first definition that we will use, located in the chaos
directory, in the fileterminate-pod.yaml
.
cat chaos/terminate-pod.yaml
This gives us the following output:
version: 1.0.0
title: What happens if we terminate a Pod?
description: If a Pod is terminated, a new one should be created in its places.
tags:
- k8s
- pod
method:
- type: action
name: terminate-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=go-demo-8
rand: true
ns: go-demo-8
Now that we have seen the definition, let’s run terminate-pod.yaml
.
chaos run chaos/terminate-pod.yaml
The output is as follows:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] No steady state hypothesis defined. That's ok, just exploring.
[... INFO] Action: terminate-pod
[... INFO] No steady state hypothesis defined. That's ok, just exploring.
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed
After the initial validation, it ran the experiment called What happens if we terminate a Pod?
and found that there is no steady state hypothesis defined
. Judging by the output, there is one action terminate-pod
.
Next, it went back to the steady state hypothesis
and determined that there is none. Then, it tried rollback
, and it found out that it could not. All we have done so far is execute an action to terminate a Pod. We can see the result in the last line: experiment ended with status: complete
.
Now, let’s output the exit code of the previous command. If we get 0
, this means success in Linux. Those exit codes tell the system whether it’s a failure or a success!
Now, let’s take a look at the Pods in our Namespace.
kubectl --namespace go-demo-8 get pods
The output states that no resources
were found in go-demo-8 namespace
.
We deployed the single Pod and ran an experiment that destroyed it. We did not do any validations. We executed a single action to terminate a Pod, which was successful.
Defining steady states
Above, all we did was destroy a Pod. The goal of chaos engineering, however, is to find weak points in our clusters. So, we normally start defining a steady-state that we test before and after an experiment.
If the state is the same before and after, we can conclude that our cluster is fault-tolerant for that case. In the case of Chaos Toolkit, we accomplish this by defining steady state hypothesis
.
We’re going to look at a definition that specifies the state that will be validated before and after an action.
cat chaos/terminate-pod-ssh.yaml
The output will give us:
> steady-state-hypothesis:
> title: Pod exists
> probes:
> - name: pod-exists
> type: probe
> tolerance: 1
> provider:
> type: python
> func: count_pods
> module: chaosk8s.pod.probes
> arguments:
> label_selector: app=go-demo-8
> ns: go-demo-8
The new section is steady-state-hypothesis
. Now we can run a proper chaos experiment to test our steady state.
Running chaos experiment and inspecting the output
Let’s run a chaos experiment to see a proper result.
chaos run chaos/terminate-pod-ssh.yaml
We get the following:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... CRITICAL] Steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: failed
There is a critical issue here: Steady state probe 'pod-exists' is not in the given tolerance
. The probe failed before we executed actions because we destroyed the Pod. So, our experiment failed and confirmed that the initial state doesn’t match what we want.
So, let’s apply the terminate-pods/pod.yaml
definition to recreate the Pod. Then, we can see what happens when we re-run the experiment with the steady-state-hypothesis
.
kubectl --namespace go-demo-8 apply --filename k8s/terminate-pods/pod.yaml
Re-run the experiment
With our pod back, and can re-run the experiment.
chaos run chaos/terminate-pod-ssh.yaml
The output is as follows:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed
Nowe, we see that the probe pod-exists
confirmed a correct state and the action terminate-pod
was executed. We can also see that the steady-state was re-evaluated. The Pod existed before the action, and Pod existed after the action. But, wow can the Pod exist if we destroyed it?
Adding a pause
The experiment didn’t fail because our probes and actions were executed immediately after one another. Kubernetes did not have enough time to remove the pod entirely. So, we need to add a pause to make the experiment more useful. Let’s look at a YAML.
cat chaos/terminate-pod-pause.yaml
It gives us the following output:
> pauses:
> after: 10
We see here that we added a pauses
section after the action
that terminates the Pod. Now, when we execute the action to terminate the Pod, the system will wait 10 seconds before validating our state.
Run the experiment with a pause
Let’s see what we get if we execute this experiment with our pause.
chaos run chaos/terminate-pod-pause.yaml
It gives us the following output:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Pausing after activity for 10s...
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... CRITICAL] Steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
This time, the probe failed and said that steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
. Now, we gave Kubernetes enough time to remove the Pod, and then we validated if the Pod is still there.
The system came back to us saying that the Pod is not present. We can output the exit code of the last command to see that it did indeed fail.
What to learn next
Awesome! We’ve effectively destroyed our application using a steady-state and learned the basics of chaos engineering. Next, we would want to fix the errors that we created to make it fault-tolerant.
From there, we can do all kinds of more destruction and testing to our application such as:
- Probing phases and conditions
- Experimenting with availability
- Draining nodes
- Executing random chaos
- and more
To learn how to implement more chaos experiments, Educative’s course The DevOps Toolkit: Kubernetes Chaos Engineering is the best next step. You’ll be introduced to the different types of experiments you can run in chaos engineering. Towards the end of the course, you will learn how to run experiments in a Kubernetes cluster. By the end, you’ll be a confident chaos engineer.
Happy learning!
Continue reading about Kubernetes and DevOps
Posted on November 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 10, 2024