So you have a disaster recovery center, you have a playbook that provides specific instructions on how to bring it online in the event of a disaster, and you test the plan once a year – but are you really as prepared as you think you are? For many of us in the Northeast, Hurricane Sandy gave us the opportunity to find out.
I’d like to share an experience that made me rethink the way we approach disaster planning and testing. I had a customer who thought they were well prepared, but when disaster struck – it turned out they were not as prepared as they thought they were.
Here’s the background: My customer was running Exchange 2007 and needed to restore Exchange services in an alternate facility due to an extended power outage at the location where the primary data center lives. All servers were located in a single site. Though the client had a secondary site, no live servers were running that could be used to restore services. So Exchange 2007’s standby cluster replication (SCR) was not in use.
The plan to recover Exchange was to build a single server with the mailbox, Hub Transport, and CAS roles, and create a dial-tone database to quickly give users the ability to send and receive email. A dial-tone mailbox is an empty mailbox that users can connect to in order to send and receive new messages. Old data is not available until a recovery operation is performed (typically from backup). Next, existing mailbox data would be restored from backup to a recovery group. The data would then be merged together, giving users all their mailbox data from before and after the failure.
Even though this was not a sexy plan, it did provide a clear course the technical team would follow to restore services for Exchange in the event of a disaster. Everyone slept well at night knowing there was a plan.
A day after the storm, I got called to help execute the plan once a decision was made to restore services in the alternate location. We built the new server quickly, including setting up the dial-tone mailboxes. We connected with Outlook Web Access to test it and everything looked good. One more test and we were ready to change DNS pointers and open the door so users could connect remotely via Citrix, Outlook Anywhere, OWA, and ActiveSync.
What we didn't consider in the plan is what the user experience would be when they opened Outlook to connect to the Mailbox database. In this case, they were presented with the message shown below.
The issue was that if users selected Use Temporary Mailbox, it would clear the Outlook cache, resulting in them losing access to their existing data from prior to the outage. This would only be a temporary problem since that data would be restored via the backups. However, we realized that without the cached data, users would lose their only access to contact and calendar information. Even worse, for those with smart phones, there was a risk that the contacts would be wiped as well. Again, something we had not thought of.
After talking it over, our solution was to open up OWA access only. This would give users the ability to send and receive email, while also allowing them to see their old messages, contact info, etc., through Outlook and smart phones.
In the end, this could have been much worse. However, what this did show us was that we were not as prepared as we thought we were.
From a higher level, this showed me that many of us need to evaluate our DR plans for client experience. I have several customers who go to great lengths to plan, implement, and test their disaster recovery plans on a regular basis. In almost every case, the plan focuses on the back end infrastructure and restoring services. Little time and effort is spent on the client experience. Even when effort has been dedicated to this, often certain assumptions are made. The potential problem is that these assumptions may not hold true during a real emergency. As such, this leaves holes in our recovery plans.
Many of us need to take a hard look at our emergency plans to try to determine if they cover complete end-to-end solutions. Additionally, we need to review our assumptions and try to determine if they cover a realistic set of real-life circumstances. After all, though we may plan for failures, we never get to decide what ultimately fails or the circumstances behind it.