Six Nines: February 2014

Tuesday, February 25, 2014

Too Much Availability

Late last year I picked up a few Christmas presents at Target, paying with my Visa Card. Recently I, like many others, received a replacement Visa Card from my bank. This is due to the massive theft of customer credit card numbers from Target.

While this blog naturally assumes that maximizing availability is a good thing, it carries with it the understood caveat that this availability is for authorized users only Data unavailability is definitely desired when an unauthorized user attempts to download credit card information.

In the same way that one must track defects that can cause loss of availability, one must also be aware of what can allow security vulnerabilities.

In his blog, Brian Krebs has shone some light on just what apparently happened at Target, with the caveat that much of this is unconfirmed (i.e. it might be all wrong. I will correct this as required). Essentially an HVAC vendor used by Target was a victim of a hacker attack and had its credentials unknowingly stolen. This allowed the hackers access to Target's external vendor billing system. From there they were able to gain access to customer credit card numbers (like mine apparently....) It is still unclear how it was even possible to make the leap from the vendor billing system to individual consumers' credit card data.

At this time I'm not going to dive into every possible improvement to be learned from this.The higher level point is while a product must have high availability, there are strong negative consequences to being available to the wrong people. The Target breach is a high-profile example. However, there's tons of little ones - stories of certain people's cloud-based pictures or emails briefly becoming publicly visible, hackers selling lists of social security numbers, etc. Just like faith in a product can be damaged by lack of availability, so too can it be damaged if the wrong people have access to data.

Thursday, February 13, 2014

Hey Look We're Famous

In my first post I mentioned sometimes six nines of availability isn't enough. So it was neat to see my prime product mentioned in the following tweet:

Five 9s? That is just SO yesterday. We're measuring SEVEN 9's in the real world with #VPLEX Metro. Now THAT'S #ContinuousAvailability.
— Paul Danahy (@spinningrust) February 11, 2014

That said, as one of the many people who obsesses about maximizing availability, I work with the realization that one outage is one outage too many.

Wednesday, February 5, 2014

Availability Metrics

How reliable is your product?

Really reliable.

How often does it suffer an outage?

Never ever ever ever! It's super-duper reliable!

How do you know?

Back when I first began my career in another millennium six sigma was becoming a big buzzword. As a co-op at Pratt & Whitney I took my first class in it and I've received training in it throughout my career.

The process has its detractors, some pointing out it's just a method of formalizing some common sense techniques. In my mind its two endpoints - the final goal of predictable results and the starting point of needing data - are both critical when it comes to availability. Now being down all the time is certainly one way to achieve predictable results but I think we can safely assume that we'd rather have a whole bunch of nines when it comes to availability.

To get to that point you need to know where you are and where you've been. You don't know how available your product is unless you are measuring it. You don't know areas of vulnerability unless you keep track of what areas experience failures. You may develop a "gut feel" for problem areas but one should have data to back that up.

Various availability metrics require you to measure the performance of your product. Off of the top of my head here are some of the things one would want to measure in an enterprise storage product when measuring and improving availability.

Number of installations
Number of downtime events
Dates of downtime events
Duration of downtime events
Hours of run-time
Versions of software/hardware being used at downtime events
Trigger/stimulus that caused downtime events
Component responsible for downtime events
Fixes that went into various releases

When you're performing your own testing internally getting these numbers is not particularly difficult. However, the true measure of availability comes when a product is being used by customers in their actual use cases. This is where it behooves you to have a good relationship with your customers so you can get this valuable data from them. Of course in the enterprise it is quite likely that a vendor and customer will have service agreements which can make much of this automatic.

The first thing these numbers can give us is a snapshot of our product's quality. We can get raw numbers for availability, mean times between failures, outage durations, etc.

We can also leverage this to help make decisions internally. If patterns emerge in typical triggers and/or responsible components then they can point the way to where improvements are needed. If a new release sees a spike in outages it will point the need for decisions like pulling a release and/or releasing a patch. And based on knowing where various fixes have been made coupled with knowing customer use cases can indicate which customers will benefit most from performing a software upgrade.

Though it is a delicate matter, sharing metrics with customers helps them make their own decisions in setting up their datacenters. Knowing what components are most vulnerable will illustrate where redundancy is required. And of course such knowledge will play a large part in what products they choose to purchase and what they are willing to pay.