Sunday, March 23, 2014

Mea Culpa, Mea Culpa, Mea Maxima Culpa - Handling Outages

Mea culpa, mea mulpa, mea maxima culpa
Through my fault, through my fault, through my most grievous fault

I'm a post-Vatican II Catholic Boy but this sort of declaration is creeping back into the English translation of the mass in its confession of sinfulness.

I've been in industry long enough to see bugs in products I've been involved in have major impacts. In a previous life working on Frame Relay, ATM, and SONET/SDH I've seen major communications backbones go down as a result of software defects. In the storage world I've seen similar results when data becomes unavailable (or worse, lost). I've seen public webpages go down, I've seen stores unable to sell their products, university networks go down.

Several years ago I made the move from development to a more customer-facing role, a group dedicated to providing engineering support. (It occurs to me that in reality development is the ultimate in customer-facing roles...) In other words, when the official support personnel needing to talk to engineering, we were the people they'd talk to. We'd be the engineers who would be on calls with customers. In one of the first calls I was on I witnessed an understandably angry customer dealing with the fact it was a peak time of the year for their business and their store was offline as a result of a problem with our product. I watched a fairly senior person on the call who demonstrated a very important trait - that of empathy. We were far from having a root cause for the incident and there was no way to know at that early stage if it was a misconfiguration, a problem with a connected product, or a problem with our product. (I seem to recall it was a combination of all three.) But the most important thing that person did that early in the incident was insure the customer was aware that we knew, that we grokked, just how serious the problem was for them and that we were doing everything we could to get to a root cause. And to back that up. To have a communication plan and to follow-up on it.

Communication is tricky when an incident is still going on. Often when figuring out a problem you go down several false leads until you uncover the true root cause. I know that in my time in development I was always very loathe to go public with theories. However, this is the sort of thing that does need to be communicated to a customer, obviously with the caveat that it is an ongoing investigation.

One thing that needs to be kept in mind - the customer is going to want a root cause, but in nearly all cases, the customer is going to want to be back up and running as their highest priority. If it comes to a decision between gathering and analyzing data over several hours and performing an administrative action that will end the outage in minutes, most customers are going to want to be back up. Obviously this is something to discuss with customers. The point being though this is the customer's time and money. It is during development and testing that a vendor has the luxury of performing detailed analysis and experiments on a system suffering from an outage - and this points to how critical it is to "shift left" discovery of problems. A customer discovering a problem is a very expensive escape. Now in some situations a full root cause will be necessary in order to end the outage but this is not always the case.

Typically after an outage is over a customer is going to want to know more about why the problem occurred, when it will be fixed, how they can avoid hitting it again, etc. These are topics I'll be covering at a later point.

Image credit: ewe / 123RF Stock Photo

No comments:

Post a Comment