The Chaos Mesh team announced the general availability (GA) of Chaos Mesh 1.0 after it was accepted as a CNCF sandbox project in July 2020. Chaos Mesh is a tool to perform chaos engineering experiments on Kubernetes applications.
Chaos Mesh uses standard CRDs for object definitions and also comes with a dashboard for managing and monitoring chaos engineering experiments. The dashboard can be used to “define the scope of the chaos experiment, specify the type of chaos injection, define scheduling rules, and observe the results of the chaos experiment”. Chaos Mesh also comes with a Grafana plugin to view real-time metrics from the chaos engineering experiments. The tool covers fault injection into “pods, the network, system I/O, and the kernel”.
InfoQ reached out to Keao Yang, maintainer and full-time developer for chaos-mesh, to learn more.
The chaos experiment can be specified as a YAML in Chaos Mesh. The failure types that can be injected include pod failures, network partition failures, virtual memory stressors, modifying the system time as seen by sys calls, and I/O delays. Chaos Mesh can also be used to simulate network latencies across multiple data centers. Under the hood, Chaos Mesh runs as two primary components – a central controller manager and a DaemonSet that runs a pod on each node, functioning as the agent. To limit the affected applications under testing, and prevent the chaos experiments from impacting critical applications, namespace level permissions and lists of protected namespaces can be provided.
Chaos Mesh originally grew out of a testing framework for TiDB – a distributed database. Yang explains that “we believed that our experience from the former practice (with TiDB testing) could help to make not only TiDB stabler, and that’s why we created Chaos Mesh. Chaos Mesh was born generic and is designed to work on any cloud platform and to test any software on the cloud.”
Chaos Mesh does not depend on any specific cloud features. It uses “only the Kubernetes API and the basic function of the Linux kernel”, says Yang, and adds:
As reported by users, Chaos Mesh can work naturally on bare metal clusters and most cloud platforms. However, for some cloud platforms (e.g. OpenShift), special privilege settings are needed. We are working on a document to record these configs.
In response to a question about how Chaos Mesh injects faults internally, Yang explains:
The implementation depends on the kind of “faults”. Some of them are quite simple. For example, Chaos Mesh uses Kubernetes API to kill pods and implement PodChaos. For some other kinds of chaos, Chaos Mesh will send grpc requests to the daemon on related nodes, and the daemon will enter the corresponding network/pid/mnt/… namespace and cgroup and run some commands (such as iptables) to inject faults.
Further, Yang adds that “injecting faults at runtime and limiting the scope of chaos could be a challenge. For example, there isn’t a thing like the time namespace before Linux 5.6, and every process shared the same clock. For this chaos, the implementation is not so straightforward and is really hard to conclude in a short answer”. Another post describes the implementation of TimeChaos – which simulates clock skew.
On the near future road map for Chaos Mesh, Yang says:
We are trying to expand the ability of Chaos Mesh as a “platform”, which means it is expected to be able to orchestrate chaos experiments, define complex chaos scenarios, and generate reports for chaos. Another important feature on the road is to support access control in the dashboard.
Other Chaos Engineering frameworks on Kubernetes are Litmus, Gremlin, and KubeInvaders. As of this writing, Chaos Mesh needs Kubernetes v 1.12 and higher and can be installed using either a supplied shell script or by using Helm charts. The source code for Chaos Mesh is available on GitHub.