Monday, January 27, 2014

What is an Outage?

One of the topics I'm going to be talking about is that of metrics.In order to have any hope of knowing a product's availability you've got to measure how often it goes down and for how long. You've got to know what the product's install base is at any given time. These are topics I'm going to delve into as this blog matures.

But let's start with what might seem simple - the question "what is an outage?" Sometimes the answer is very obvious. If a SAN switch just goes offline then that switch has clearly experienced an outage.

On the other hand, sometimes the answer is less obvious. For example, consider the following typical arrangement:


You've got your host connected to a storage array via two switches, one of which has failed. On the host is some multipathing software which routes any I/O request to the storage array through one of the two switches. If your product was responsible for providing multipathing on the host then the outage's responsibility is not solely with the switch vendor - indeed a customer may view it to be solely with the provider of multipathing software if they had planned their environment accounting for the fact these hypothetical switches have a lower availability.

Let's add some complexity to the mix. What if the multipathing software is working fine but the customer did not configure it correctly? For example suppose that it was setup to use only one path and it would only fail over to the other path upon a manual request. This brings to mind one of my least favorite phrases - "customer error". But one must be extremely unwilling to make the root cause for any outage. Was the multipathing documentation clear? Did it alert the user to the fact that one failure could cause an outage? 

And consider taking it to a greater extreme. Suppose the multipathing software is configured perfectly and after the switch fails all I/O is routed to the other switch. But then suppose a few hours later the other switch fails too. Is the multipathing software absolved? Not necessarily. Did the multipathing software make it clear to the user that it is one failure away from unavailability? And making it clear is vital. Is that information embedded in some log or is it in some alarm that screams for an administrator's attention?

At the end of the day, an outage really is defined as "whatever the customer says it is". And those who are truly working at maximizing availability will go beyond even that definition. You want to provide a product that a customer will not only not worry about but will relieve worry, secure in the knowledge that your product is there. When you are providing a product at the enterprise level, any failures of your product have consequences. At the very least, the person who signed off on purchasing your product may have his or her job jeopardized. Beyond that, money can be lost, power grids can go offline, organizations can be unable to operate, and people can actually die. 

No comments:

Post a Comment