Scheduling Chaos: An introduction to the Litmus Chaos Scheduler
Sanjay Nathani
Posted on July 31, 2020
Introduction
Hey all! I am Sanjay Nathani, one of the Contributors to the LitmusChaos Project & a Software Engineer at MayaData. By now, I assume that you are already familiar with the concept of cloud-native chaos engineering and how the litmuschaos project enables you to achieve it here.
As members of the larger chaos engineering community, one of the observations we made while examining the use-cases of different adopters was that chaos needs to be made available as a background service. While random injections via manual execution of the experiments in pre-prod/production (read gamedays) and CI-driven execution on dev environments is still the norm in many cases, there are a lot of organizations adopting a continuous-chaos strategy as part of a shift-left paradigm, in which staging clusters (or equivalent environments that mimic prod characteristics and traffic) are subject to service and infrastructure faults repeatedly in a periodic or random fashion. The goal, in most of these cases, is to observe the resilience of the microservices at various times/operational states. It is common knowledge that the load on the microservices in a cluster varies over the course of its existence - there are peak traffic periods - which may last for few hours in a day or few days in a month, etc., and it is necessary to compare how the KPIs (key performance indicators) fare at different periods upon failures.
Based on this, we decided to create the chaos-scheduler to inject chaos repeatedly, while providing a flexible schema for developers and SREs by which they can automate chaos runs while being able to define minimum intervals between two instances of chaos or specify the total number of chaos instances across a time range, etc.,
What is Chaos Scheduler?
The Chaos Scheduler is a Kubernetes controller (built using the Operator-SDK framework) that reconciles a custom resource called ChaosSchedule, which, essentially, is a higher-level abstraction that embeds within itself the (now-familiar) ChaosEngine template along with a schedule specification. While still an alpha component today, the Chaos Scheduler is seeing adoption already and is poised towards becoming an optional component in the Litmus deployment bundle (helm chart).
In this blog, letโs take a closer look at the scheduling options provided by the chaos scheduler and how you can give it a spin in your cluster.
Dissecting the ChaosSchedule Custom Resource
The ChaosSchedule is the core schema that defines the chaos workflow for a given Application Under Test (AUT) or Node Under Test (NUT). It defines the following:
Execution Schedule for the experiments
Template Spec of ChaosEngine detailing the chaos action
As mentioned earlier, one of the goals with the Chaos Scheduler was to provide a flexible and rich set of configuration options, for, there is already the standard Kubernetes Cron Job if the requirement is only about repeating the chaos action. As of today, there are 3 ways in which we can schedule the chaos to be injected:
Now: This will trigger the chaos as soon as the ChaosSchedule CR is created and is similar to the on-demand execution model available today.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Once: This will schedule the chaos at a specific time denoted by executionTime.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
once:
executionTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Repeat: This type of schedule will ensure the repeated execution of chaos over a time range. We define the startTime & endTime with a minChaosInterval specified to ensure a mandatory cool-off period to observe adherence to MTTR (Mean-Time-To-Recover). This option also allows whitelisting/blacklisting days of a week for chaos.
Here is a sample of how to inject the chaos in this way.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
repeat:
startTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
endTime: "2020-05-12T05:52:00Z" #should be modified according to current UTC Time
minChaosInterval: "2m" #format should be like "10m" or "2h" accordingly for minutes and hours
instanceCount: "2"
includedDays: "mon,tue,wed"
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Needless to say, the ChaosSchedule is referenced as the owner of the secondary resources (chaosengine) with Kubernetes DeletePropagation policies ensuring their removal too upon deletion of the ChaosSchedule CR.
In the subsequent section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.
Getting Started
In this section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.
Install Litmus Chaos Operator, RBAC and CRDs
kubectl apply -f https://litmuschaos.github.io/pages/litmus-operator-latest.yaml
namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created
Install Chaos Scheduler and it's CRDs
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/crds/chaosschedule_crd.yaml
customresourcedefinition.apiextensions.k8s.io/chaosschedules.litmuschaos.io created
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/chaos-scheduler.yaml
deployment.apps/chaos-scheduler created
Create the pod delete Chaos Experiment in default namespace
NOTE: In this example, I intend to inject chaos on a single replica Nginx deployment running in the default namespace. Modify according to your environment.
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/experiment.yaml
chaosexperiment.litmuschaos.io/pod-delete created
Setup the RBAC for execute the pod-delete chaos
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/rbac.yaml
serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created
Before proceeding let's see whether all the things are up and running successfully or not.
Chaos Scheduler and Chaos Operator should be running perfectly
kubectl get po -n litmus
chaos-operator-ce-5cd5894879-k7wgz 1/1 Running 0
10m
chaos-scheduler-84fcccb5bd-mjpnj 1/1 Running 0
10m
Ensure the Service Accounts for scheduler and operator are created
kubectl get sa -n litmus
default 1 10m
litmus 1 10m
scheduler 1 10m
Ensure the service account for the intended experiment is created successfully
kubectl get sa
default 1 10m
pod-delete-sa 1 10m
Now we can safely move further
Create a ChaosSchedule yaml with the application and experiment information along with the scheduling logic
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
namespace: litmus
spec:
schedule:
repeat:
startTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
endTime: "2020-05-12T05:52:00Z" #should be modified according to current UTC Time
minChaosInterval: "2m" #format should be like "10m" or "2h" accordingly for minutes and hours
instanceCount: "2"
includedDays: "mon,tue,wed"
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Create a ChaosSchedule custom resource
kubectl apply -f chaos-schedule.yaml
Watch the injection of chaos at any point of time
watch kubectl get pod
Describe the ChaosSchedule for the details of chaos injection.
kubectl describe chaosschedule schedule-nginx
Name: schedule-nginx
Namespace: default
Labels: <none>
Annotations: API Version: litmuschaos.io/v1alpha1
Kind: ChaosSchedule
Metadata:
Creation Timestamp: 2020-05-14T08:44:32Z
Generation: 3
Resource Version: 899464
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/default/chaosschedules/ schedule-nginx
UID: 347fb7e6-2c9d-428e-9ce1-42bdcfdab37d
Spec:
Chaos Service Account:
Engine Template Spec:
Appinfo:
Appkind: deployment
Applabel: app=nginx
Appns: default
Chaos Service Account: litmus
Components:
Runner:
Experiments:
Name: pod-delete
Spec:
Components:
Rank: 0
Job Clean Up Policy: retain
Schedule:
Repeat:
End Time: 2020-05-12T05:52:00Z
Included Days: Mon,Tue,Wed
Instance Count: 2
Min Chaos Interval: 2m
Start Time: 2020-05-12T05:47:00Z
Schedule State: active
Status:
Active:
API Version: litmuschaos.io/v1alpha1
Kind: ChaosEngine
Name: schedule-nginx
Namespace: default
Resource Version: 899463
UID: 14f49857-8879-4129-a5b9-a3a592149725
Last Schedule Time: 2020-05-14T08:44:32Z
Schedule:
Start Time: 2020-05-14T08:44:32Z
Status: running
Total Instances: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 39s chaos-scheduler Created engine schedule-nginx
Halting ChaosSchedule
At any point of time we can halt a chaosschedule which simply means stopping the further execution of chaos. Here is the way to halt the chaosschedule.
Power of halting a schedule comes into action when we do not want to disturb the production cluster or an application at some point of time because of some important activity(migration) going on. We can halt the schedule without putting in the efforts of deleting and recreating the schedule.
Change the spec.ScheduleState to halt
spec:
scheduleState: halt
Conclusion
With the Chaos Scheduler the user is not burdened with trying to re-apply chaosengine manifests or remember to do chaos at different times by himself/herself and instead only has to compare execution results! As you read this, the Chaos Scheduler is being improved to support randomized execution within a time range. So, more power coming your way!! Do try out the steps and let us know what you feel about the scheduler and what use-cases it must support!
Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
LitmusChaos is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by
inducing chaos tests in a controlled way. Developers & SREs can practice Chaos Engineering with LitmusChaos as it is easy to use, based on modern
Chaos Engineering principles & community collaborated. It is 100% open source & a CNCF project.
LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. The platform itself runs as a set of microservices and uses Kubernetes
custom resources (CRs) to define the chaos intent, as well as the steady state hypothesis.
At a high-level, Litmus comprises of:
Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows