Six Nines: outages

Sunday, March 23, 2014

Mea Culpa, Mea Culpa, Mea Maxima Culpa - Handling Outages

Mea culpa, mea mulpa, mea maxima culpa
Through my fault, through my fault, through my most grievous fault

I'm a post-Vatican II Catholic Boy but this sort of declaration is creeping back into the English translation of the mass in its confession of sinfulness.

I've been in industry long enough to see bugs in products I've been involved in have major impacts. In a previous life working on Frame Relay, ATM, and SONET/SDH I've seen major communications backbones go down as a result of software defects. In the storage world I've seen similar results when data becomes unavailable (or worse, lost). I've seen public webpages go down, I've seen stores unable to sell their products, university networks go down.

Several years ago I made the move from development to a more customer-facing role, a group dedicated to providing engineering support. (It occurs to me that in reality development is the ultimate in customer-facing roles...) In other words, when the official support personnel needing to talk to engineering, we were the people they'd talk to. We'd be the engineers who would be on calls with customers. In one of the first calls I was on I witnessed an understandably angry customer dealing with the fact it was a peak time of the year for their business and their store was offline as a result of a problem with our product. I watched a fairly senior person on the call who demonstrated a very important trait - that of empathy. We were far from having a root cause for the incident and there was no way to know at that early stage if it was a misconfiguration, a problem with a connected product, or a problem with our product. (I seem to recall it was a combination of all three.) But the most important thing that person did that early in the incident was insure the customer was aware that we knew, that we grokked, just how serious the problem was for them and that we were doing everything we could to get to a root cause. And to back that up. To have a communication plan and to follow-up on it.

Communication is tricky when an incident is still going on. Often when figuring out a problem you go down several false leads until you uncover the true root cause. I know that in my time in development I was always very loathe to go public with theories. However, this is the sort of thing that does need to be communicated to a customer, obviously with the caveat that it is an ongoing investigation.

One thing that needs to be kept in mind - the customer is going to want a root cause, but in nearly all cases, the customer is going to want to be back up and running as their highest priority. If it comes to a decision between gathering and analyzing data over several hours and performing an administrative action that will end the outage in minutes, most customers are going to want to be back up. Obviously this is something to discuss with customers. The point being though this is the customer's time and money. It is during development and testing that a vendor has the luxury of performing detailed analysis and experiments on a system suffering from an outage - and this points to how critical it is to "shift left" discovery of problems. A customer discovering a problem is a very expensive escape. Now in some situations a full root cause will be necessary in order to end the outage but this is not always the case.

Typically after an outage is over a customer is going to want to know more about why the problem occurred, when it will be fixed, how they can avoid hitting it again, etc. These are topics I'll be covering at a later point.

Image credit: ewe / 123RF Stock Photo

Monday, January 27, 2014

What is an Outage?

One of the topics I'm going to be talking about is that of metrics.In order to have any hope of knowing a product's availability you've got to measure how often it goes down and for how long. You've got to know what the product's install base is at any given time. These are topics I'm going to delve into as this blog matures.

But let's start with what might seem simple - the question "what is an outage?" Sometimes the answer is very obvious. If a SAN switch just goes offline then that switch has clearly experienced an outage.

On the other hand, sometimes the answer is less obvious. For example, consider the following typical arrangement:

You've got your host connected to a storage array via two switches, one of which has failed. On the host is some multipathing software which routes any I/O request to the storage array through one of the two switches. If your product was responsible for providing multipathing on the host then the outage's responsibility is not solely with the switch vendor - indeed a customer may view it to be solely with the provider of multipathing software if they had planned their environment accounting for the fact these hypothetical switches have a lower availability.

Let's add some complexity to the mix. What if the multipathing software is working fine but the customer did not configure it correctly? For example suppose that it was setup to use only one path and it would only fail over to the other path upon a manual request. This brings to mind one of my least favorite phrases - "customer error". But one must be extremely unwilling to make the root cause for any outage. Was the multipathing documentation clear? Did it alert the user to the fact that one failure could cause an outage?

And consider taking it to a greater extreme. Suppose the multipathing software is configured perfectly and after the switch fails all I/O is routed to the other switch. But then suppose a few hours later the other switch fails too. Is the multipathing software absolved? Not necessarily. Did the multipathing software make it clear to the user that it is one failure away from unavailability? And making it clear is vital. Is that information embedded in some log or is it in some alarm that screams for an administrator's attention?

At the end of the day, an outage really is defined as "whatever the customer says it is". And those who are truly working at maximizing availability will go beyond even that definition. You want to provide a product that a customer will not only not worry about but will relieve worry, secure in the knowledge that your product is there. When you are providing a product at the enterprise level, any failures of your product have consequences. At the very least, the person who signed off on purchasing your product may have his or her job jeopardized. Beyond that, money can be lost, power grids can go offline, organizations can be unable to operate, and people can actually die.