Like many other service providers, our team has had disaster recovery plans (DRP) for many years. However over time we identified several problems with them as a group, such as poor assumptions, incomplete documentation, and inconsistent testing. Revising our DRPs to use a new, common template in 2022 helped some... but not enough.
We wanted to level-set our assumptions, dependencies, and expectations to increase consistency; improve our understanding of how we'd actually implement our DRPs in the event of a disaster; and identify and remediate any gaps or blind spots, including any incorrect assumptions and missing documentation. We wanted everyone to operate with a common set of assumptions, write and test their DRPs similarly, and identify all service dependencies up and down the stack.
We ran tabletop exercises for three of our services' plans over two days: Our VMware service, our primary web hosting service, and our on-campus data center. In each exercise we started with a brief problem description (such as "An attacker has..." or "You were informed that...") and set the team loose to focus on identifying and remediating the problem within the constraints of the DRP.
We measured success both qualitatively and quantitatively, and we improved from the first to the second session. Participants provided useful feedback in the post-event survey. We identified several gaps and made plans to fill them. We also identified ways to make our DRPs better and to run future exercises better.