Wednesday, February 5, 2014

Availability Metrics

How reliable is your product?

Really reliable.

How often does it suffer an outage?

Never ever ever ever! It's super-duper reliable!

How do you know?


Back when I first began my career in another millennium six sigma was becoming a big buzzword. As a co-op at Pratt & Whitney I took my first class in it and I've received training in it throughout my career.

The process has its detractors, some pointing out it's just a method of formalizing some common sense techniques. In my mind its two endpoints - the final goal of predictable results and the starting point of needing data - are both critical when it comes to availability. Now being down all the time is certainly one way to achieve predictable results but I think we can safely assume that we'd rather have a whole bunch of nines when it comes to availability.

To get to that point you need to know where you are and where you've been. You don't know how available your product is unless you are measuring it. You don't know areas of vulnerability unless you keep track of what areas experience failures. You may develop a "gut feel" for problem areas but one should have data to back that up.

Various availability metrics require you to measure the performance of your product. Off of the top of my head here are some of the things one would want to measure in an enterprise storage product when measuring and improving availability.
  • Number of installations
  • Number of downtime events
  • Dates of downtime events
  • Duration of downtime events
  • Hours of run-time
  • Versions of software/hardware being used at downtime events
  • Trigger/stimulus that caused downtime events
  • Component responsible for downtime events
  • Fixes that went into various releases
When you're performing your own testing internally getting these numbers is not particularly difficult. However, the true measure of availability comes when a product is being used by customers in their actual use cases. This is where it behooves you to have a good relationship with your customers so you can get this valuable data from them. Of course in the enterprise it is quite likely that a vendor and customer will have service agreements which can  make much of this automatic.

The first thing these numbers can give us is a snapshot of our product's quality. We can get raw numbers for availability, mean times between failures, outage durations, etc.

We can also leverage this to help make decisions internally. If patterns emerge in typical triggers and/or responsible components then they can point the way to where improvements are needed. If a new release sees a spike in outages it will point the need for decisions like pulling a release and/or releasing a patch. And based on knowing where various fixes have been made coupled with knowing customer use cases can indicate which customers will benefit most from performing a software upgrade.

Though it is a delicate matter, sharing metrics with customers helps them make their own decisions in setting up their datacenters. Knowing what components are most vulnerable will illustrate where redundancy is required. And of course such knowledge will play a large part in what products they choose to purchase and what they are willing to pay.

No comments:

Post a Comment