How to run a gameday exercise in complex, multi-product environments

In my role as Senior Software Developer here at Ad Hoc, I recently led a gameday exercise for a number of systems we support for the Centers for Medicare & Medicaid Services under our Website Development Support (CMS WDS) contract. A gameday exercise is a series of mock disaster recovery drills run in a technical environment designed to test the skills and processes teams have put in place to deal with production incidents.

I first learned about this method when I worked with technologist Dylan Richard. Here’s a presentation he gave about gameday. I was excited to run my first exercise here at Ad Hoc with CMS, and I used much of what I learned from Dylan.

Preparation

I started working with my colleagues here at Ad Hoc and at our customers at CMS three months before execution. We worked to define the size and scope of the exercise, get their approval, and make sure it fit into their needs. Most large organizations have requirements around disaster recovery simulations, so this exercise is often welcome, if not required. In the course of defining the exercise with CMS, we learned that our efforts fulfilled requirements under NIST (“Contingency Planning Guide for Federal Information Systems” and “Guide to Test, Training, and Exercise Programs for IT Plans and Capabilities”.

Next, I prepared an isolated environment in which to conduct the exercise. This environment mirrored the composition and size of our production environment, but at a slightly smaller scale. I ensured that all application code matched the production setup, and configured monitoring tools— (AWS CloudWatch, New Relic, etc.) and alerting (Pager Duty, private Slack integrations, and so on), tools to match the production configuration.

I reviewed past production incidents for each application, and used those to inform the design of the scenarios I ran. We maintain a collection of incident response documents, containing records of each production incident, making this review simple and straightforward.

I designed at between two to four scenarios for each application team. Larger, more complex applications received more scenarios, as there were more moving parts to test and break. Each scenario consisted of three attributes: what event would occur, how that would affect the system, and how that event would be communicated to the application team.

What: drop all inbound network connectivity for an app and its databases

How: AWS console

Alert: New Relic monitors, Pager Duty incident

In general, I did not specify too much detail in the scenarios. One or two sentences was sufficient to convey the intent of each scenario. In designing the scenarios, I tried to cover four common modes of system failures:

  1. Human failure: a person executes a command or job that breaks something

  2. Infrastructure failure: a system, such as a database, fails, causing dependent systems to fail

  3. Application code failure: errors in the application code cause failures

  4. Partner failure: third parties, such as system support personnel or tools do not perform as expected

Execution

Three days before the exercise, I reminded the entire team that we had a game day approaching:

@channel We’ve got client-wide “game day” this Wednesday. This means that the entire team will practice responding to staged incidents for each application (list).

What does this mean?

Starting at 11a Eastern time on Wednesday, I will start breaking the applications in the imp1b environment. Not all applications will break at the same time. You’ll start seeing notifications, and should treat the exercise as if it’s a bona fide production incident (read: don’t ignore it).

What do I need to do to prepare?

Check that you know how to:

  1. access New Relic dashboards for your applications
  2. view your applications logs
  3. log into the AWS Console (if your app uses AWS)
  4. SSH into servers (if your app uses EC2)

Review Incident Management documentation

Ensure you’re available from 11a - 330p on Wednesday

Why should I care?

Applications break all the time; how we respond to failure is what differentiates us.

The night before the exercise, I finalized the list of scenarios that I would run and sent that to our client partners. I also spot checked New Relic monitoring and application logging to ensure the systems were working correctly.

The morning of the exercise I reminded the team of the exercise and provided some information on logistics:

Hello @channel, we’ll commence the game day activities shortly. First, a few notes about logistics:

  • Events will start no earlier than 10a Central time, and we’ll wrap by 230p Central.
  • Activities will take place largely in the staging environments. (Note: due to ongoing content editing in one application, incidents related to it occured in test.
  • In the case of a bona fide production incident concurrent with the game day activities, I will announce in Slack that the game day is suspended.
  • Do not use the #incident channel; use [the proper channel] instead.
  • Take a moment to review the Incident Response Process.
  • For the purpose of this exercise, I’ll be available on by email.

At 10:28am, I initiated the game day activities by triggering the first scenario, a security breach. I maintained a running activity log that I updated each time I took a discrete action:

Timestamp Activity Evidence
1028 Send security incident PD Link to PD incident

For the next five hours, I split my time between triggering additional incidents, monitoring the response, and adjusting the next set of scenarios based on the progress teams were making. I didn’t adhere to a strict schedule, but instead prepared a variety of options that I could pull from as needed. I also posted updates to a private HipChat channel that the client used to monitor the activities and ask questions about the exercise.

I monitored each team’s response efforts; occasionally, I would add additional complexity to their response by removing a team member for a period of time, or forcing them to find an alternate channel to coordinate their response. These surprises were intended to simulate the unpredictable nature of incident response – sometimes responders get pulled out of the response, or systems they take for granted, like Slack or New Relic, aren’t available.

What happened

Over the course of the 5 hour exercise, I executed 7 scenarios; at one point the teams were responding to 4 simultaneous scenarios. I logged 31 discrete actions. The 21-person Ad Hoc team posted 1,207 messages in Slack over the course of the exercise. In the end, the team was successful in responding to each scenario. For each scenario, the team successfully triaged the incident, restored service, and produced an incident response document recording the steps they took. All teams were engaged for the entire 5 hour period of the exercise, without breaks.

During the course of the response, I observed the teams working on roughly similar fashions. Someone, usually the engineer on-call in Pager Duty, would receive an alert. They would then pass it along to the rest of the team, and open an incident document to start tracking the response. The team would then dive in, assigning responsibility – one person would assume the “incident commander” role and lead the response, another would take notes in the incident document, and the remainder of the team would split up the triage and technical response.

At the end of the exercise, the entire Ad Hoc team joined a retrospective to talk about how they felt, how they viewed the exercise, and their immediate feedback on what they learned. In general, the team was exhausted but glad they participated, and they found it to be a valuable exercise.

Notes

I collected all relevant artifacts of the exercise — Slack transcript, Hipchat conversations, activity log, screenshots, and so on — and archived them in a place where all team mates, even those not on the project, can access and learn.

The following resources were instructive in the planning of the exercise:

  1. 10 Gameday Failure Testing Scenarios

  2. Game-Day Testing: Throw Things at Your System and See What Happens

  3. Velocity 2013: Dylan Richard, “Gamedays on the Obama Campaign”

  4. Google Site Reliability Engineering

Thank you to Paul Smith and Ryan Nagle for comments and feedback on this report. Thank you to Wryen Meek for managing the logistics of setting up the game day with our client.

Next steps

The Ad Hoc team compiled their seven incident reports and is filing tickets to track work to improve monitoring, alerting, and creating documentation to use for future incident response. The exercise revealed bona fide issues in our readiness, especially gaps in knowledge of how to access systems and how to debug problems. We’re appreciative for our client, CMS, for being such a great partner in this— reviewing and approving the plan, monitoring us in the execution, and working with us as we continuously improve.

Join us!

Want to participate in our next gameday exercises? Join the team.