The new version V0.6.0 of Chaos Engineering ChaosMeta is now officially released! This version contains many new features and enhancements. The orchestration interface provides support for various nodes including traffic injection, measurement, etc., and provides visual support for the entire drill process. Solve the last problem of "continuously automated running experiments" in the principles of chaos engineering.

Introduction

ChaosMeta is a cloud-native chaos engineering platform designed for automated exercises. It provides platform functions such as visual orchestration and scheduling, data isolation, and multi-cloud management, as well as rich fault injection capabilities, covering the entire life cycle of the drill. It embodies the methodology, technical capabilities and product capabilities that Ant Group has accumulated over many years in large-scale red and blue offensive and defensive drills at the company level.

New Features

In the new version, fault capabilities such as DNS anomalies and log injection are added, and the visual orchestration interface provides support for various nodes such as traffic injection and measurement, providing support capabilities for automated chaos engineering.

▌Lossless injection

Log injection is a simple fault capability that essentially appends text content to files. But the idea of lossless injection extended from this is more important.

As the name suggests, non-destructive injection is to conduct drills without really affecting the business, and to discover the shortcomings of the application's emergency procedures such as monitoring alarms, hemostasis, and self-healing. It is a risk-minimized drill that is very suitable for production environments. Way.

There are generally two implementation solutions for lossless injection:

If the monitoring indicators of an application rely on log content, then by injecting the corresponding content into the application's log file, the perfection of the relevant emergency processes of the target application can be verified non-destructively;
Directly tamper with the monitoring data of target monitoring items (such as CPU usage) to verify whether the subsequent emergency process is complete. Below are two walkthrough scenarios for log injection:

（1）Common Error

We usually monitor the number of combined keywords such as "Error" and "Exception" in the log file to determine whether there is an abnormality in the status of an application. If a sudden increase occurs, it is highly likely that the application has failed. Therefore, this type of failure can be simulated using the ability to log injection (file append)

（2）Interface request time consuming

Due to the performance impact on monitoring collection and reporting, some applications use an asynchronous collection solution. The RPC framework outputs the request time and return code of each interface to the log, and then the collection agent asynchronously collects data from the log file. Report.

The following is an example of time-consuming monitoring and collection of a message push interface. The time-consuming data of each interface is collected from the log file and reported to the monitoring platform:

In this case, you can also use the fault capability of log injection (file append) to simulate a fault scenario where the request takes too long, without actually injecting a network fault into the application.

Lossless injection is very efficient in scenarios that only require rapid verification of standardized emergency response capabilities such as monitoring alarms, positioning, and plans for a large number of applications.

▌Automate running experiments

The industry-recognized principles of chaos engineering are:

Develop a hypothesis surrounding steady-state behavior
Diverse real-world events
Run experiments in production
Minimize explosion radius
Continuously run experiments automatically

There are always ways to implement the first four of them in the industry, but the last "continuous automated operation experiment" has never had a better implementation plan.

Of course, many well-known chaos engineering projects have tried to solve this problem, and most of them provide product capabilities for scheduled execution. However, it is doubtful whether this scheduled execution capability can be used on a large scale in a production environment.

The reason is that on the one hand, fault injection is a high-risk action. Without sufficient pre-admission and other inspection operations, there is not enough sense of security to automatically trigger; on the other hand, a drill is not just "fault injection", we often also need Do a lot of other "manual analysis" work, such as checking whether the target application status and current environment meet the preset conditions and whether the traffic is satisfied before injecting faults. After injecting faults, discover how much time it takes to locate and recover, analyze emergency efficiency, etc.

ChaosMeta disassembles this kind of "manual analysis" work into different types of atomic execution tasks, divided into various types of nodes such as "fault injection", "metric execution", "traffic injection", "waiting", etc., and then based on Flexible orchestration capabilities are combined into automated drill scenarios with various business semantics. Here are a few simple examples:

High service availability and constant freshness

We all have high service availability requirements for online applications, such as the ability to have multiple copies + automatic load balancing of services. Regular regular drills are a way to ensure that high availability capabilities are kept fresh.

Since it is a production environment, we cannot just automatically initiate it casually, because you cannot guarantee whether multiple copies of an application will be available at all times. For example, before the drill, only one copy happens to be available, and there are a large number of users. In the case of traffic access, the drill you configured is automatically launched, which will cause immeasurable consequences.

In this example, there are several operations that can increase your confidence in automated drills: confirm that there are multiple copies of the application, that the service user traffic is within an acceptable range, and that the application can return to the multi-copy state after the drill.

The purpose of our exercise is to verify the high availability of the service, so the corresponding service availability measurement is also necessary.

So as long as all the worrying factors are configured into the choreography, as long as the running result is successful, this drill will be a drill that meets expectations. On the contrary, if the execution fails, the relevant person in charge will be notified through the alarm to intervene. This will greatly free up manpower investment in drills.

Red and blue attack and defense automation

When conducting red and blue offensive and defensive drills, the blue army is generally responsible for designing scenarios for the drill, and finally makes an objective evaluation of the red army's emergency efficiency (personnel, platforms), in order to guide the direction of the red army's defense capability building.

A common way to judge the standard is to judge whether the Red Army's emergency discovery, location, and recovery time in a fault meet the requirements of 1, 5, and 10 minutes respectively. Otherwise, points will be deducted. Since "time consuming" is involved, an accurate starting time point (fault takes effect) and target time point (discovery, location, and recovery time) are necessary.

The starting time point is the moment when the Red Army believes that the fault standard has been formed, and this is not necessarily the moment of fault injection in the traditional sense. For example, if the target service promises to guarantee a service delay of less than 3000ms, then only the network delay will be more than 3000ms. If it is considered to be a fault, then the Red Army should go to the emergency response. Similarly, the business recovery target is also below 3000ms. However, if the calculation is directly based on the fault injection operation time, it will cause a large error. Even this "fault injection" may not actually cause the fault that the Red Army thinks, so fault effectiveness measurement is also a very necessary link.

In order to evaluate the emergency response efficiency of the Red Army for each failure, the Blue Army needs to collect data from various emergency platforms (monitoring platform, positioning platform, self-healing platform, etc.). Manual collection and analysis is a very arduous task. However, traditional chaos engineering platforms only have fault simulation capabilities, and these "manual operations" must be performed repeatedly for the same drill scenario. ChaosMeta hopes to configure these "manual operations" into the platform to improve drill efficiency.

Network fault attack and defense drills

This is a simple red and blue attack and defense drill example, for applications in scenarios where network latency is too large

Since network traffic monitoring is involved, service traffic is a necessary condition. Without service traffic, network delay is injected, and an alarm of excessive service delay will not be triggered. Therefore, a node for Mock traffic needs to be configured;

Access detection is also required to measure whether the current flow level meets expectations, otherwise the next process will most likely not be approved by the Red Army;

Then what is left is to measure the failure effective point and business recovery time point, which is used to collect data to analyze emergency efficiency.

Future Direction

Next, we will continue to improve our capabilities in all aspects

Support multi-cloud and non-cloud management, that is, manage cross-cluster pods/nodes and non-k8s machines/bare containers;
Improve the data analysis capabilities of measurement capabilities. Currently, it can only measure a single moment such as failure effectiveness, location, recovery, etc., and it is not possible to combine the moments of multiple nodes for analysis (for example: recovery time - effectiveness time < target time consumption);
Support more atomic capabilities of various types of nodes, such as supporting business-level fault capabilities of mainstream open source projects such as mysql, oceanbase, and redis;
Support some fault capabilities and measurement capabilities related to the stability of large model training and inference architecture risks, such as GPU high load injection.

Join Us

As an open project, we recognize the open source R&D model and are committed to building the ChaosMeta community into an open and creative community. In the future, all R&D, discussions and other related work will be run transparently in the community.

We welcome any form of participation, including but not limited to questions, code contributions, technical discussions, demand suggestions, etc. Looking forward to receiving community ideas and feedback to drive the project forward.

If you are interested in our project or design concept, please star our project to support it.

GitHub：https://github.com/traas-stack/chaosmeta
Documentation：https://chaosmeta.gitbook.io/chaosmeta-en
Email：chaosmeta.io@gmail.com
Twitter：AntChaosMeta
DingTalk Group：21765030887
WeChat Public Account: