On-call teams at startups have three big problems: they’re small, they cover a wide breadth of infrastructure, and the last two points usually imply that they lack the bandwidth to maintain and write documentation for a suite of DevOps tools. At SigOpt, our on-call team tackles these challenges with a biannual “disaster recovery exercise”, or simulated outage.
In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team. I hope that by the end you’ll consider running a disaster recovery exercise for your on-call team!
What is a Disaster Recovery Exercise?
A disaster recovery exercise is a fire drill for your on-call team. The exercise is the most useful when it is as realistic as possible. A well-designed exercise will involve engineers searching through your production codebase trying to find the tools to operate on a production-like environment.
Our disaster recovery exercises follow four basic principles:
- All on-call engineers are gathered in one room
- Sterilized environment (like prod, but not prod)
- Clear objective
- Timeboxed recovery
At SigOpt, we run on AWS, so our first exercise was to spin up an API from scratch in our backup region. Our sterilized environment was us-east-1, with no access to AMIs, instances, or databases in our production region. Our objective was to hit dr-api.sigopt.com and service an API requests. Our timebox was 4 hours, which we chose from an engineering OKR.
Tip: Create new AWS keys for each exercise to avoid accidentally deleting production resources (and temporarily deactivate current keys to ensure the new ones are used!)
Disaster Recovery Exercise as an Infrastructure Diagnostic
We ran our original disaster recovery exercise to diagnose holes in our ability to recovery our infrastructure. True to our goal, the exercise produced a few months of projects to work on.
While we found many larger projects, funnily enough, though, the quickest fixes were usually the least obvious bugs. For example:
- Our install script, which runs multiple times a day, depended on a pre-existing “sigopt” directory, which was not present on fresh Ubuntu AMIs
- Newer engineers did not know about the existence of our machine management scripts
To find problems large and small, we run a debrief meeting to conclude the exercise. In this meeting, we talked candidly about what worked, and what didn’t, referring to notes taken during the exercise.
Here are some of the questions that we ask ourselves during our debriefs:
- Was the objective met? How could it have been met faster?
- Which tools were used? Which were not (but should have been)?
- What tools were broken? What were slow?
- What weren’t we able to recover?
At SigOpt, a Disaster Recovery Exercise is a Team Learning Experience
At SigOpt, we are constantly trying to learn. Though started as a way to diagnose infrastructure, the disaster recovery exercise quickly proved to be a fantastic trial-by-fire learning opportunity for our small team, and engineers reported increased self-confidence in their on-call problem solving abilities.
We use the following principles to set up the team dynamics for the exercise:
- Newer team members to lead the recovery part of the exercise, so they maximize their opportunity to learn from hands on experience
- Team leads plan the exercise and lead the recap, sitting back and taking notes during the recovery portion
- To set the tone, we communicate clearly beforehand with the team that this exercise is a learning opportunity for everyone, it is NOT a test of engineers’ personal ability to spin up resources
Additionally, because all team member are together, in one room, working on one problem, the disaster recovery exercise is a unique team-building exercise. To extend the team-building atmosphere after the recap meeting, we’ve started to include dinner and drinks as an offsite!
A disaster recovery exercise is many things for us. It’s a fire drill that proves to newer team members they have what it takes to be a part of our on-call rotation. It’s a diagnostic to identify bad, broken, or hard-to-find tools. And, it’s a team bonding exercise where everyone sits down together for a few hours to solve a challenge. Next time you’re planning a team event for your on-call team, I hope you’ll consider your own disaster recovery exercise! If you do, I’d love to hear about it.