In the New York area we’re continuing to deal with very real and personal impacts of Sandy. The sheer magnitude of the loss is frankly overwhelming. Many lives were lost, thousands of people remain displaced as their homes are either gone or condemned, huge pieces of the local infrastructure are still not functioning, and many local shoreline treasures are potentially damaged beyond repair.
At Gotham, we don’t do anything particularly heroic in the face of disasters. We don’t rush into burning buildings or climb telephone poles in gale force winds. We just try to keep our customers’ IT running. I realize that this can sometimes be a trivial pursuit in the face of the much larger dramas being played out in a disaster like this, but it’s our job. We’ve had some interesting experiences with that job over the past few weeks, so I thought I’d blog about it. Here are some lessons learned from an IT Business Continuity standpoint. Many of these reflect Gotham’s individual experience in the storm, but they are also indicative of issues we have heard from clients:
- “Good Enough” is not good enough. Many of us (Gotham included) had plans to recover services in stages. We planned to recover critical services first for a smaller user population. Over time all services would be restored and solutions could scale for larger populations. Although this all sounded good in planning, it fell short of the business need during the event.
- Beware any product feature specifically promising Business Continuity or High Availability. Make sure you understand what sorts of fault the product is prepared to recover from and what types of issues will stop it in its tracks. Gotham, like many of our clients, uses a VoIP system to provide call forwarding and a myriad of other useful features for fault tolerance. However, when the physical phone lines and power are cut to your main units, there’s not a lot they can do for you. Many customers had storage replication but learned the hard way that there’s a big gap between “I have all the server images” and “We’re up and running.”
- You’re only as good as your providers. Gotham uses a messaging service to provide after-hours and disaster support for call distribution to our support lines. This service has a number of redundancies and has served us well through several previous events. During this storm however, they physically lost some primary systems and experienced an outage. We were left scrambling to provide support on our cell phones. Many customers also experienced issues as heretofore stable recovery sites and plans were not available due to the size and severity of this event.
- You don’t know what you don’t know and you’re never going to learn it. Every event brings a new set of challenges. As an organization, we’re very dependent on being able to physically get our employees to customer sites. Delivering able-bodied individuals to Manhattan locations in an environment with no gas, no mass transit, and precious few hotel rooms was challenging. One of our solutions was to dispatch engineers who live in the city and have them use bikes to get around. In other cases we simply commuted off hours and found places for them to sleep. Does this mean we should hire more engineers who live in the city and make sure they all have working bicycles? Probably not. I don’t think this is the kind of thing that you can actually plan for. I think you stay as agile as you can and roll with it. This time it was gas, who knows what the next event will bring.
Over at Gotham we’re helping our customers make improvements in there BC plans and making a few improvements to ours as well. Here’s hoping that all of the more dramatic and lasting problems brought by Sandy are able to get better as well.