The concept of chaos engineering emerged as a way to detect problem areas in your system before outages occur. Netflix created its variation of the concept in 2010 called Chaos Monkey to help ensure that any outages of instances running on Amazon Web Services (AWS) would not have an impact on its streaming services.
But for those organizations that lack the bandwidth to implement a Netflix-like Chaos Monkey process to test their systems, Gremlin Free, launched in February, allows an organization or individual to get started with chaos engineering without having to use a credit card to see how it works.
During this demonstration for The New Stack; Lorne Kligerman, director of product at Gremlin, described in detail the concept of chaos engineering and how it works by using Gremlin Free.
“Downtime is expensive and damages customer trust,” Kligerman said. “Gremlin gives you the ability to safely and securely find weaknesses in your system before they become problems.”
Among other things, chaos engineering allows organizations to introduce failure into their systems in a planned-and-controlled way, Kligerman said. “You can find those issues yourself before they cause outages in the real world and before they really hurt your end-user experience — which is what we are looking to prevent,” Kligerman said.
Breaking and observing how your monitoring software and other systems react is something Kolton Andrus, co-founder and CEO of Gremlin, is familiar with, as he was also a chaos engineer at Netflix. Kligerman also said he was able to observe how chaos engineering worked while working at Google as a product manager and solutions engineer. Kligerman said “the only way to really create a really resilient system is to try to break it on your own terms.”
The idea with Gremlin Free is to “get it into the community so everyone can give it a go and to expand on the concept of Chaos Monkey,” Kligerman said. “This is more of a random thing when you’re randomly shutting down instances in servers,” Kligerman said. “And we are really about practicing chaos engineering in a thoughtful a controlled manner.”
An individual or organization gets started with Gremlin Free by first installing the Gremlin client on a server, which can be a physical host, a virtual machine or a container, which can be included in Kubernetes deployment. With Gremlin Free, two different attacks are introduced: CPU and state attacks, which is a shutdown attack and similar to what Chaos Monkey does, Kligerman said.
During the demo, Kligerman showed what you see on a Datadog dashboard monitor when spikes go up. It is also possible to see events indicating when the attacks started and stopped as part of the chaos engineering process.
When you launch a CPU attack, for example, it allows you watch your instance and to make sure your monitoring tools are working properly, by seeing whether you can see a spike and whether your tool sends you an alert when that spike happens, Kligerman said. With a CPU attack, you select how many seconds you want the attack to last and how many CPU cores you want to affect. It is possible to run the attack now or schedule for later or even randomly within a timeline.
When running a simple CPU or shutdown attack, “you can see what happens with your auto-scaled instances when they disappear,” Kligerman said. “This is the way to do that with no credit card required to install it on your own time, run an attack and see what happens,” Kligerman said. “We think you will see some impact there, you will make your systems better and your customers will be happier because of it.”
The idea is to allow the industry to learn firsthand how chaos engineering works through hands-on practice. “We think the industry has really understood what it is and why you should do it,” Kligerman said. “Now, it’s all about how I get started, how do I see the impact of chaos engineering on my system, whether you are a small start-up or a very large company with hundreds or thousands of teams.”