June 11th 2020
Over 20 years of experience in IT with a passion for application architecture and security.
Your infrastructure will fail. It’s not an if but a when. As the rise of microservices and serverless make apps more distributed, potential fault points are rising exponentially. We may attempt to engineer our systems expecting certain failures only to make things worse, such as a well-intended retry logic overloading a stressed server even worse and causing failures to cascade across the enterprise.
In the old days you had a person or one team who understood your system so well that they could engineer against most failures or immediately diagnose and fix the unexpected production failures that did slip though. That was possible with a monolithic app. With a microservices architecture, those days are coming to an end. We now need an approach to system resilience that assume the system is too complex for humans to understand and assumes things will break in ways we cannot predict. Chaos engineering is a methodology that takes that approach.
What Is Chaos Engineering?
Chaos engineering is a methodology that discovers your system’s faults by intentionally injecting problems into production systems in a controlled manner. Faults are wide ranging, from latency, simulated disk failure, node outage, and even simulating the outage of an entire region.
Benefits of Chaos Engineering
Pioneered by Netflix, it is still a fairly new methodology. Many firms have shown interest in it, but few outside the tech industry are executing it. Of those that have, few have reached a high level of maturity. Yet firms I have spoken with who have either have implemented it or have clients who have implemented it all spoke positively of chaos engineering and see strong value. Here are some benefits they found.
- Less impactful outages. Utilizing chaos engineering as part of a broader site reliability engineering effort results in fewer and less impactful system outages. You will discover lurking fragility before that fragility causes a wide-spread outage.
- Agility. A key benefit of distributed or microservices architecture is agility. But you can’t move fast if the exponential growth in services results in an exponential growth in failures. Chaos engineering brings these failures under control so that you can reap maximum agility benefits from distributed apps.
- Fewer manual testing hours. Automated and randomized chaos tests result in more test coverage and fewer labor hours spent manually testing for fragility.
- Improved ROI. Your resilience investments reap better ROI because you become focused on hardening systems with proven fragility but not over-investing in redundancy for systems proven resilient by the automated chaos testing.
- Better code. Some companies report higher quality code because developers and architects change their mindset to viewing outages as something expected rather than a rare exception or afterthought. This is especially true if you run chaos tests continuously in non-prod and developers come in the next morning to see the code they checked into QA just broke a bunch of automated chaos tests.
- More effective recovery. A common chaos test is to run disaster recoveries for experimental outages. Doing this discovers problems in recovery scripts and procedures before a real emergency. Also, the employees who execute recovery keep recovery knowledge fresh that would otherwise be lost in a too-stable system where recovery is rarely executed.
- Improved monitoring. Chaos testing helps improve your overall monitoring and distributed traceability due to the requirements to execute a chaos test, as you will see below.
Characteristics of a Chaos Test
Although some people think chaos engineering is a wild west of just
going into production and seeing what happens, that is absolutely not
the case. There is a lot of infrastructure maturity you must have in
place as a prerequisite. Every chaos test has the following key
characteristics, and your infrastructure just have the maturity to meet
these characteristics before you can execute chaos testing.
- Confidence. Never test anything in production that you think might fail. A chaos test is first vetted in non-prod. If it fails there, you fix the issue. Only when you are 100% confident it will not fail do you run it in production.
- Risk is contained. You typically isolate a subset of your traffic into two groups: a test group that has the fault injected and a control group whose monitors are compared against the test group’s monitors. Maybe 0.5% of your traffic is in the test group and 0.5% in the control monitoring group with the other 99% being business as usual. This means that a chaos test isn’t just bringing some node completely offline as some mistakenly believe. Rather, it’s bringing it offline only for 0.5% of its clients, or whatever % you choose for your test group.
- Hypothesis. Every test should have a hypothesis based on business performance indicators. For example, an online retailer could have the thesis that if their product suggestion service fails to respond within 200ms then the system will simply not suggest products while all other functionality continues unaffected.
- Failure criteria. Define a failure criteria upfront for when to abort the test. For example, if the average response time of the test group exceeds the control group by >10% then we will abort, or if the test group shows 1% more errors logged than the control group then abort.
- Monitoring. You must have mature monitoring infrastructure to observe the health of the system during the test to be able to immediately abort. Netflix in particular strongly emphasizes the need for distributed tracing. This relates to the above containment point. You probably don’t want the test group to be 0.5% of your traffic. Rather, you may want it to be 0.5% of your customers, which means assigning 0.5 of your customers to the test group (and another 0.5% to the control group) and tracing each customer’s calls throughout the system so all calls spawned by that customer remain in the appropriate group.
- Automation. The test can be executed automatically through CI/CD scripts. Aborting the test can also be done immediately through automation. At its highest maturity you constantly run vetted chaos tests in production without even notifying on call people. These tests automatically abort and raise an alert when they detect the failure criteria. Automating these infrastructure changes typically also requires containers or public cloud. Automating chaos tests on non-cloud VMs or bare metal can be challenging if not impossible.
Reaching that level of maturity is difficult and is likely why many companies have not fully embraced it yet. You should be able to see from those steps that it requires a maturity in public cloud or containers, CI/CD, and end-to-end monitoring that many companies aspire to have for many reasons besides chaos engineering but have not yet attained.
Nonetheless, like any IT investment, sometimes the highest level of maturity does not bring enough value to your unique situation to justify the cost. Many companies find value from going partway there, similar to how many companies are decomposing their monoliths without going full blown microservices. One company I’ve talked to, for example, did significantly more chaos testing in a prod-mirror acceptance environment than true production. They found this produced sufficient value that they saw no need to invest further maturing production chaos testing.
What About Critical Systems?
A common argument against chaos engineering is that we shouldn’t run it against critical systems because we can’t risk an outage caused by a failed chaos test. While this seems intuitive, it is very wrong and stems from a misunderstanding of what a chaos test is. Recall that you do not run chaos tests that you expect to fail, you run them on a small subset of your traffic, and you have a way to abort the test immediately if things go haywire.
If you’re too worried to run a chaos test on a critical system, even with a test group as small as 0.1% traffic, then you must expect it to fail for some reason or else you wouldn’t have that fear. Thus you have not reached the point where a chaos test is appropriate, but you have also validated the need for chaos testing: you don’t have confidence in the system’s stability! Determine what that fear is, then do tests in non-prod to either identify the faults you need to fix or convince yourself it really is stable enough to run in production, your fear having been unwarranted.
Finally, weigh the cost of not doing chaos testing against the cost of a failed chaos test. For example, suppose a retailer ran a test on their billing system with 0.5% of their traffic in the test group. The test passes in non-prod, but then in prod it fails. They then lose 20% of those test group purchases (or 0.1% of all purchases) over a 10 minute period before aborting the test.
Yes, those handful of lost sales are a bad outcome. But consider the alternative: had they not found and fixed this issue, then come Black Friday their system might have failed for real, thus losing 20% of those sales all day long with no way to abort. That makes the bad impact of a failed production chaos test sound pretty good!
Chaos engineering is an interesting development in IT. I expect we will see it grow it in the coming years. Developing basic chaos engineering skills and maturity now will get your company ahead of your competition and lay the foundation for the rapid growth of a future microservices architecture.