Sunday, March 23, 2014

Mea Culpa, Mea Culpa, Mea Maxima Culpa - Handling Outages

Mea culpa, mea mulpa, mea maxima culpa
Through my fault, through my fault, through my most grievous fault

I'm a post-Vatican II Catholic Boy but this sort of declaration is creeping back into the English translation of the mass in its confession of sinfulness.

I've been in industry long enough to see bugs in products I've been involved in have major impacts. In a previous life working on Frame Relay, ATM, and SONET/SDH I've seen major communications backbones go down as a result of software defects. In the storage world I've seen similar results when data becomes unavailable (or worse, lost). I've seen public webpages go down, I've seen stores unable to sell their products, university networks go down.

Several years ago I made the move from development to a more customer-facing role, a group dedicated to providing engineering support. (It occurs to me that in reality development is the ultimate in customer-facing roles...) In other words, when the official support personnel needing to talk to engineering, we were the people they'd talk to. We'd be the engineers who would be on calls with customers. In one of the first calls I was on I witnessed an understandably angry customer dealing with the fact it was a peak time of the year for their business and their store was offline as a result of a problem with our product. I watched a fairly senior person on the call who demonstrated a very important trait - that of empathy. We were far from having a root cause for the incident and there was no way to know at that early stage if it was a misconfiguration, a problem with a connected product, or a problem with our product. (I seem to recall it was a combination of all three.) But the most important thing that person did that early in the incident was insure the customer was aware that we knew, that we grokked, just how serious the problem was for them and that we were doing everything we could to get to a root cause. And to back that up. To have a communication plan and to follow-up on it.

Communication is tricky when an incident is still going on. Often when figuring out a problem you go down several false leads until you uncover the true root cause. I know that in my time in development I was always very loathe to go public with theories. However, this is the sort of thing that does need to be communicated to a customer, obviously with the caveat that it is an ongoing investigation.

One thing that needs to be kept in mind - the customer is going to want a root cause, but in nearly all cases, the customer is going to want to be back up and running as their highest priority. If it comes to a decision between gathering and analyzing data over several hours and performing an administrative action that will end the outage in minutes, most customers are going to want to be back up. Obviously this is something to discuss with customers. The point being though this is the customer's time and money. It is during development and testing that a vendor has the luxury of performing detailed analysis and experiments on a system suffering from an outage - and this points to how critical it is to "shift left" discovery of problems. A customer discovering a problem is a very expensive escape. Now in some situations a full root cause will be necessary in order to end the outage but this is not always the case.

Typically after an outage is over a customer is going to want to know more about why the problem occurred, when it will be fixed, how they can avoid hitting it again, etc. These are topics I'll be covering at a later point.

Image credit: ewe / 123RF Stock Photo

Tuesday, February 25, 2014

Too Much Availability

Late last year I picked up a few Christmas presents at Target, paying with my Visa Card. Recently I, like many others, received a replacement Visa Card from my bank. This is due to the massive theft of customer credit card numbers from Target.

While this blog naturally assumes that maximizing availability is a good thing, it carries with it the understood caveat that this availability is for authorized users only Data unavailability is definitely desired when an unauthorized user attempts to download credit card information.

In the same way that one must track defects that can cause loss of availability, one must also be aware of what can allow security vulnerabilities.

In his blog, Brian Krebs has shone some light on just what apparently happened at Target, with the caveat that much of this is unconfirmed (i.e. it might be all wrong. I will correct this as required). Essentially an HVAC vendor used by Target was a victim of a hacker attack and had its credentials unknowingly stolen. This allowed the hackers access to Target's external vendor billing system. From there they were able to gain access to customer credit card numbers (like mine apparently....) It is still unclear how it was even possible to make the leap from the vendor billing system to individual consumers' credit card data.

At this time I'm not going to dive into every possible improvement to be learned from this.The higher level point is while a product must have high availability, there are strong negative consequences to being available to the wrong people. The Target breach is a high-profile example. However, there's tons of little ones - stories of certain people's cloud-based pictures or emails briefly becoming publicly visible, hackers selling lists of social security numbers, etc. Just like faith in a product can be damaged by lack of availability, so too can it be damaged if the wrong people have access to data.




Thursday, February 13, 2014

Hey Look We're Famous

In my first post I mentioned sometimes six nines of availability isn't enough. So it was neat to see my prime product mentioned in the following tweet:

That said, as one of the many people who obsesses about maximizing availability, I work with the realization that one outage is one outage too many.

Wednesday, February 5, 2014

Availability Metrics

How reliable is your product?

Really reliable.

How often does it suffer an outage?

Never ever ever ever! It's super-duper reliable!

How do you know?


Back when I first began my career in another millennium six sigma was becoming a big buzzword. As a co-op at Pratt & Whitney I took my first class in it and I've received training in it throughout my career.

The process has its detractors, some pointing out it's just a method of formalizing some common sense techniques. In my mind its two endpoints - the final goal of predictable results and the starting point of needing data - are both critical when it comes to availability. Now being down all the time is certainly one way to achieve predictable results but I think we can safely assume that we'd rather have a whole bunch of nines when it comes to availability.

To get to that point you need to know where you are and where you've been. You don't know how available your product is unless you are measuring it. You don't know areas of vulnerability unless you keep track of what areas experience failures. You may develop a "gut feel" for problem areas but one should have data to back that up.

Various availability metrics require you to measure the performance of your product. Off of the top of my head here are some of the things one would want to measure in an enterprise storage product when measuring and improving availability.
  • Number of installations
  • Number of downtime events
  • Dates of downtime events
  • Duration of downtime events
  • Hours of run-time
  • Versions of software/hardware being used at downtime events
  • Trigger/stimulus that caused downtime events
  • Component responsible for downtime events
  • Fixes that went into various releases
When you're performing your own testing internally getting these numbers is not particularly difficult. However, the true measure of availability comes when a product is being used by customers in their actual use cases. This is where it behooves you to have a good relationship with your customers so you can get this valuable data from them. Of course in the enterprise it is quite likely that a vendor and customer will have service agreements which can  make much of this automatic.

The first thing these numbers can give us is a snapshot of our product's quality. We can get raw numbers for availability, mean times between failures, outage durations, etc.

We can also leverage this to help make decisions internally. If patterns emerge in typical triggers and/or responsible components then they can point the way to where improvements are needed. If a new release sees a spike in outages it will point the need for decisions like pulling a release and/or releasing a patch. And based on knowing where various fixes have been made coupled with knowing customer use cases can indicate which customers will benefit most from performing a software upgrade.

Though it is a delicate matter, sharing metrics with customers helps them make their own decisions in setting up their datacenters. Knowing what components are most vulnerable will illustrate where redundancy is required. And of course such knowledge will play a large part in what products they choose to purchase and what they are willing to pay.

Tuesday, January 28, 2014

Boom - How Do You Protect Data if the Datacenter Ceases to Exist?


Sometimes corporate videos can be a bit on the cheesy side but I really like the one I'm posting above as it allows me to explain to my family one of the main things the product I work on, VPLEX, does. It also gets some extra-credit as early in my career with EMC I had the opportunity to work with Steve Todd, one of the people in the video.

One of the ultimate problems in protecting data is what to do if the datacenter actually ceases to exist. In my career I've seen this happen for many reasons. On the low-end, there is the problem of power outages or maintenance windows which temporarily take a datacenter offline - the datacenter ceases to exist for a finite amount of time. On the more extreme end the datacenter can cease to exist permanently. Extreme weather, often accompanied by flooding, is one culprit that takes out datacenters. Similarly bolts of lightning have destroyed datacenters. Though the loss of data pales next to the loss of life, the September 11 terrorist attacks did illustrate how deliberate acts of destruction could also affect data. And as I mentioned in my first post, loss of data can in some cases lead to loss of life. While researching this topc, I found an article at PC World quoting Leib Lurie, CEO of One Call Now:
The 9/11 attacks "geared people toward a completely different way of thinking," Lurie said. "Everyone has always had backup and colocation and back-up plans, every large company has. After 9/11 and [Hurricane] Katrina and the string of other things, even a three-person law firm, a three-person insurance agency, a doctor with his files, if your building gets wiped out and you have six decades of files, not only is your business gone, not only is your credibility gone, but you're putting hundreds of lives at risk." 
The loss of a doctor's records could be fatal in some cases, and with the loss of a law firm's records, "you could have people tied in knots legally until you find alternative records, if you find them," Lurie said.

There's various options that a storage administrator can take to protect a datacenter. At the very least there would need to be a backup stored off-site, with backup methods ranging from periodic snapshots to continuous data replication. And, amazingly enough,  EMC provides solutions for both these options, with Mozy and RecoverPoint. (Hey, I warned you that while I'm not writing on behalf of my employer I'm still a fan.). Mozy is geared more for the PC and Mac environment, taking periodic snapshots and backing them up to the Mozy-provided cloud whereas RecoverPoint  uses journaling to keep track of every single write, allowing for very precise rollbacks.

These options allow for disaster recovery. If the disaster occurs you will experience an outage, albeit one you can recover from. As my family's IT manager I find that suits our needs very well - when my wife replaced her laptop we simply told Mozy to restore to a new laptop. I myself tend to be more in the cloud full-time and use Google Drive as my main storage, giving me replication and the ability to recover.

However, while recovering from an outage without loss of data (or minimal loss with some solutions) is fantastic, many enterprise solutions need continuous availability. Recovery from a disaster is not sufficient. I know I would have been rather annoyed if my bank told me my information was not available while they recovered from backup in the aftermath of one of the many storms that have hit us here in Massachusetts over the past several years. That's where having a solution which allows for continuous availability even in the event of the destruction of a datacenter, becomes essential. That's one of the things that my product, VPLEX does - you are able to mirror writes to two sites separated by substantial distances. And just as importantly, it is possible to perform read the same data from either datacenter. If one datacenter ceases to exist, the other one is still up and is able to continue operating (as the rather dramatic video at the beginning of this post illustrates). And for even more protection, many customers combine the RecoverPoint and VPLEX products, allowing for both rollback and continuous data protection.

All of this comes at a cost, making users balance their availability needs vs. their budget. Not all applications need continuous availability. But those that do need to be able to endure a wide variety of potential problems, ranging from maintenance windows all the way to the datacenter destruction described here. And providing these solutions presents challenges a vendor must address. For example, some of the most obvious include:

  • What does an availability solution tell a host a data write has completed? This question becomes more and more critical as the latency between sites increases.
  • What does an availability solution do if a remote site cannot be seen? How does it determine if the remote site has had an outage or if the communications link has been severed?
  • How is re-synchronization handled when two or more separated sites are brought back together?


Monday, January 27, 2014

What is an Outage?

One of the topics I'm going to be talking about is that of metrics.In order to have any hope of knowing a product's availability you've got to measure how often it goes down and for how long. You've got to know what the product's install base is at any given time. These are topics I'm going to delve into as this blog matures.

But let's start with what might seem simple - the question "what is an outage?" Sometimes the answer is very obvious. If a SAN switch just goes offline then that switch has clearly experienced an outage.

On the other hand, sometimes the answer is less obvious. For example, consider the following typical arrangement:


You've got your host connected to a storage array via two switches, one of which has failed. On the host is some multipathing software which routes any I/O request to the storage array through one of the two switches. If your product was responsible for providing multipathing on the host then the outage's responsibility is not solely with the switch vendor - indeed a customer may view it to be solely with the provider of multipathing software if they had planned their environment accounting for the fact these hypothetical switches have a lower availability.

Let's add some complexity to the mix. What if the multipathing software is working fine but the customer did not configure it correctly? For example suppose that it was setup to use only one path and it would only fail over to the other path upon a manual request. This brings to mind one of my least favorite phrases - "customer error". But one must be extremely unwilling to make the root cause for any outage. Was the multipathing documentation clear? Did it alert the user to the fact that one failure could cause an outage? 

And consider taking it to a greater extreme. Suppose the multipathing software is configured perfectly and after the switch fails all I/O is routed to the other switch. But then suppose a few hours later the other switch fails too. Is the multipathing software absolved? Not necessarily. Did the multipathing software make it clear to the user that it is one failure away from unavailability? And making it clear is vital. Is that information embedded in some log or is it in some alarm that screams for an administrator's attention?

At the end of the day, an outage really is defined as "whatever the customer says it is". And those who are truly working at maximizing availability will go beyond even that definition. You want to provide a product that a customer will not only not worry about but will relieve worry, secure in the knowledge that your product is there. When you are providing a product at the enterprise level, any failures of your product have consequences. At the very least, the person who signed off on purchasing your product may have his or her job jeopardized. Beyond that, money can be lost, power grids can go offline, organizations can be unable to operate, and people can actually die. 

Thursday, January 23, 2014

Welcome

Welcome to my professional corner of the internet. And who am I?

I'm Dan Stack, a software engineer, currently employed by EMC Corporation. And first things first, while I intend this blog to deal with "professional" topics in the world of storage, virtualization, etc., the opinions in this blog are my own. I am not representing my employer and am not posting on my employer's behalf. Though I'm clearly going to be somewhat biased. I like the products I work on, the people I work with, and the job that I do.

So, to quote Office Space, "What would ya say...ya do here?" I've got a few roles at present. I'm a software engineer on our VPLEX platform. Specifically I am both our Defect Czar and our Software DU/DL Lead. I'll likely go into more detail over time what that means. But the short form is one of my main jobs is to serve as our conscience, to be the voice of the customer, and to help drive software architecture decisions with that in mind. I sit in an unusual space between software development, quality engineering, and customer support. Bringing me to this point I've had years and years of experience in software development along with experience in advanced customer support.

Over the years I've become very passionate about the quality, reliability, and availability of the products I've worked on. I've grown to feel that one of my least favorite phrases is "customer error". For some in the development world it's a difficult journey - sometimes you look at an error that occurs in the field and you wonder "what were they thinking?" And typically, the answer is "something you didn't account for"? I imagine it's impossible to make a 100% bulletproof product, but I do believe that that's a worthy goal to aspire towards. An example of this in another industry can be seen in improvements made to automobiles. One could argue pretty reasonably that it is not the responsibility of a vehicle to keep its driver awake. But given the danger of drivers falling asleep, there have been advances in automotive safety to detect drowsy drivers. Sure, this is correcting the driver's "fault". But a quick search of the internet will find several studies that indicate driver drowsiness/sleeping is at fault for a significant number of automobile accidents.

Can someone die from a datacenter outage? Almost certainly. Consider how advances in technology have given to computerized monitoring of patients. Now consider all the monitors in a hospital intensive care unit or neonatal care unit going offline.

Without even trying very hard I could think of tons of examples of real-world impacts from datacenter outages, ranging from inconvenient (an MMO game going offline) to a financial disaster (lost or erroneous transactions) to deadly (hospital outages). And this doesn't even begin to cover malicious intent, as can be seen in many news stories of stolen credit cards.

In this blog I hope to talk about these kinds of issues. What can go wrong. What can be done to protect users. What can be learned from these sorts of events. What are some success stories. My own area of expertise is in virtualization but I'm going to try to not limit myself to that. One thing I won't be doing is divulging any proprietary information that I have access to. If I ever go beyond the hypothetical in issues that I've worked with personally I'll be certain to use only publicly available information.

I don't view this as a blog likely to be updated on a daily basis. I'm going to shoot for an update every few weeks, though I imagine early on I'll be updating a bit more frequently.I imagine there might be a little jumpiness early on as well as I find my voice in this blog (I've done and continue to do other blogs, though those have been more hobby/personal-based.)

Why the title six-nines? What that means is your system will be up (and available!) 99.9999% of the time. In a typical year you could expect about 30 seconds of downtime. It's often considered a golden number, but in some applications that too is insufficient. There's other measures that I'll be discussing such as mean time between failure, outage duration time, etc. (For example, if there's 30 seconds of downtime in a typical year is that all at once, a six five-second outages, one minute over two years, etc.)