Monday, August 18, 2014

Failing Safe

How does a system handle an "impossible" scenario? My background is originally as a software developer and it's a question the teams I've been on have often been faced with.

I'm going to do a little software explaining here but I'll endeavor to keep things pretty general. Software functions are designed to perform certain tasks with different parameters given to them each time. For example, a device driver writing to a traditional hard disk drive will need to know things like sector to write to, where the data to be written is, etc. One of the first things any function will do is validate the parameters it is dealing with. Often this is done with assertion statements. Basically you assert that a parameter needs to meet certain conditions - for example, the data that is being written has to be from a valid place. If it's not, the program will halt execution. That's great in a development environment - it stops things at the moment a problem occurs, allowing a developer to fix the problem right away. Sometimes when software ships the assertions will be turned off other times they will be left on under the assumption these conditions are impossible and if the software reaches them it is better to shut down than to continue operating.

This is something that is difficult to balance. The example I gave above is a trivial example - handling errors in that case is very easy. However, what is often the case is you are dealing with a complicated system where function A calls function B which calls function C which calls function D, all the way down to function Z. And there's numerous ways to get to function Z. And the same parameter is checked in all the preceding functions - and had been valid. Suddenly it is invalid. Often this happens due to a coding or design error - for example some asynchronous process stamps on the parameter, dodging protective measures designed to stop that from taking place.

I don't want to trivialize how difficult these problems are to discover. These "impossible" issues typically require a very convoluted sequence of events to manifest and are even more difficult to reproduce. There are  many methodologies in existence or being developed designed to prevent these issues from being introduced or to catch them if they are.

This long-winded prelude is not dedicated to the effort of preventing and catching these types of issues - something any product should absolutely be doing. What I am talking about here is what to do in those cases where you miss some permutation and are in an "impossible" situation. What is the best thing to do in those situations?

Under such circumstances, the most reliable systems will fail in the safest possible way. In a best-case scenario, the system will simply fail a single I/O operation. If that is not possible, it should fail in the least possible destructive manner. It would be better, for example, to take a single volume down than to take down an entire array. It would be better to fail a single blade than to fail an entire blade server. A good metaphor (and one now more used in computer networking than in its original concept) is that of a firewall. A physical firewall, while not able to put out a fire, is designed to stop it from spreading further.

While this may seem self-evident, realizing this typically means developing additional software whose main purpose is to fence off a problem that is often introduced by the software, a concept that may seem a bit odd. Essentially, the software enters into a mode designed to get back to a point of stability while doing the least possible damage.

One way of mitigating this challenge is not asking the software to do the impossible or even to fully recover without outside intervention. For example, if a storage volume needs to be taken offline due to an "impossible" situation, it would be nice if it could be brought up automatically but the main goal in a system failing safe would be to minimize the damage so as, for example, not to take other volumes down or to fail the entire system. In such cases it may take manual support intervention to bring that volume back online. Not great, but far preferable to no storage being available.

Some imagination is required when developing such protective measures. To again use a physical metaphor, consider the loss of RMS Titanic. She did have bulkheads designed to handle the possibility of flooding. However, they were inadequate, not going high enough, such that the water from one compartment was able to flood other compartments.

As I wrote earlier, it is far better for these situations to never arise and that should be the goal. However, in my many years in the software industry I have yet to find a perfect product. As such, systems should be designed to fail in the least destructive way.


Image Credits:
Nuke Image - Copyright: Eraxion / 123RF Stock Photo

"Titanic side plan annotated English" by Anonymous - Engineering journal: 'The White Star liner Titanic', vol.91.. Licensed under Public domain via Wikimedia Commons.

Friday, June 27, 2014

Can You Hear Me Now? Being on the Receiving End of a Data Unavailability




Oddly I find myself today (and yesterday) on the customer side of a Data Unavailability situation and it seemed to good an opportunity not to reflect upon it. Unlike much of what I write about, I don't know what the "true" story is. In some ways that makes it easier to write - as someone with some amount of inside knowledge I have to be very mindful to be respectful of company secrets and proprietary information. On the other hand, it's somewhat frustrating as it is both an annoyance to me in that it's stopping me from doing something I want to do and also because I'd really like to know what is going on.

Anyway, some background. I'm a techie. I love cycling through new phones, notebooks, tablets, etc. when finances allow and I've gotten good at navigating Gazelle, eBay, Amazon, etc. when it comes to selling my gently used gadgets to get the newest stuff. On the other hand my wife, while far from a technophobe, is perfectly content running a device into the ground, getting every bit of use out of it and upgrading only when she has a genuine need to do so. (Hmm, she's coming out a lot smarter than me...)

Anyways, my wife had finally reached the point that she was finding her phone, a Motorola Droid Razr Max no longer meeting her needs. Given that I'm the family IT department, she gave me her requirements and I came up with a few phones that seemed right for her and she decided on upgrading to a Samsung Galaxy S5.  So I went ahead and placed the order on Verizon, even splurging on overnight shipping.

(It's probably worth noting that despite my annoyance with Verizon right now, in general I've been happy with our service from Verizon Wireless - though if they really want to make it up to me they can open up to Nexus phones...)

So the phone arrives yesterday. We try to activate it but no luck. I call Verizon customer support and they tell me they are upgrading their systems and to try again in about an hour. No problem. But an hour later still no luck. Reading about other ways to activate a phone I see there's a way on My Verizon to do so. But then I discover that's not working either. (As you an see by the screen captures.) Some Googling reveals that this had been a problem since yesterday morning around 5 AM-ish. So it had been going on for over 14 hours by this point.

Next morning arrives. Still the same problem. And as I write this it's around 9:30 PM on day 2 of the outage. I did learn a bit more about what's going on, but mainly through pursuing Twitter and other sources - and not from Verizon, which has not issued any statements beyond a note after logging on to "My Verizon" that some services are currently unavailable and a pair of tweets issued quite recently....


The bad thing I've learned about the billing system being down, beyond the obvious inability to pay bills, is one cannot activate a new phone. Similarly people trying to put in insurance claims for broken phones are unable to verify they own Verizon phones.

On one hand, it's a First World Problem. For my family, it's a minor annoyance, delaying our ability to activate a shiny new phone.  On the other hand, for some people it is proving a major inconvenience; such as those who use their Verizon phone as important parts of their personal or business lives and are unable to replace a broken phone or perform similar activities. And I've read it is agony on people trying to sell Verizon phones.

What I am observing from the outside is just how frustrating it is not knowing what's going on. Being unable to do what you've paid for (or to even make a payment) is frustrating enough. However, the lack of communication adds to the frustration. The latest tweet (indicating the issue has been identified) is a good communication but similar tweets should have been issued periodically - i.e. "we are working to identify the issue", "we have called in additional support", etc. - I understand there are certain things they are not at liberty to disclose but at the end of the day there is a certain responsibility to paying customers. Indeed it might not even be their "fault" in the sense that it may be an outside vendor's equipment which is failing (and if so I'm crossing fingers that it's nothing I've ever worked on in my career) but at the end of the day (well two days in this case) the service that they provide to their customers is compromised.

I'll also admit I'm super-curious what happened.

[Update - and the services are now back up.]

Sunday, May 25, 2014

Don't Press the Jolly Candy-Like Button


How can he possibly resist the maddening urge to erradicate history at the mere push of a single button? The beautiful, shiny button? The jolly, candy-like button? Will he hold out, folks? Can he hold out?

- Ren & Stimpy - "The Button"




So a customer has had a Data Unavailability Event and you have a root cause. Or better yet, you discover it internally before any customer encounters the problem. (Better still would have been to discover and fix the problem before shipping.)

This brings you to a difficult question - what do you tell your customer base about the problems you know about? Note that nothing I write in the following paragraphs should be construed to mean you should ever keep information from your customers - rather this represents some thoughts as to how you should communicate with your customers.

First and foremost the need to do this sort of communication is indicative of a failure somewhere in your release product. I've worked for many companies and have yet to find one that doesn't experience these failures. The better ones however have processes in place to learn from these mistakes and improve their quality going forward.

So back to the original premise. What do you tell your customers? And how do you communicate with them? And the answer to both of these is the infamous "it depends".

I've seen the following types of criteria used:
  • Is it avoidable? Certain types of problems are consistently triggered by a certain type of use case under certain (presumably unusual) conditions. There may be a workaround to fulfill the use case without risking the Data Unavailability condition. In any case, since this is indicative of a preventable problem this points to the type of communication which should be loud and clear. In these cases you really should consider pushing these sorts of communication aggressively. 
  • Is it an upgrade issue? I've seen some issues that happen during or after an upgrade. Sometimes this is due to a regression, sometimes due to something in the upgrade process, sometimes due to a new feature which may have an unintended effect. In these conditions you'd want to make certain that customers are aware of the potential issues prior to performing the upgrade. Including a list of these known issues in your upgrade documentation (and being able to update this documentation as new issues are discovered) allow your customers to make an intelligent decision regarding an upgrade. Some customers may be better staying at an older version until a patch with a fix is available. Others may want to take a preventative action prior to performing the upgrade.
  • Is it a likely issue? This is not to say you should not tell customers about unlikely issues. Rather, it is indicative as to whether or not you should use aggressive broadcasts to make certain they are aware of the issue. If there's no workaround for a likely issue but a patch is available you absolutely want to issue an alert to your customers.
  • Is this a severe issue? Any outage is severe. But if there is something with potential for a catastrophic outage with a long recovery time then you need to give serious conditions to broadcasting the problem instead of release noting it.
  • Is there a recovery process? After a problem is hit, if there is some process to recover there should be some documentation easily located by customers to tell them how to get out of the situation. Knowledgebase Articles are a popular means of doing this.
Even an unlikely problem which has no avoidance mechanism and will automatically recover should be documented. An up to date set of release notes is a good place to put such information (often in addition to the above possibilities). What would a customer want to know? Typical items include:
  • What happens if the problem is hit?
  • Under what conditions will the problem be hit?
  • What is the likelihood of hitting the problem? Is it deterministic?
  • What can be done to avoid the problem?
  • What needs to be done to recover from the problem?
  • What versions of software are vulnerable to the problem.
The worst thing you could find yourself doing is explaining to a customer how you knew about the potential for a problem but didn't tell them,

Sunday, March 23, 2014

Mea Culpa, Mea Culpa, Mea Maxima Culpa - Handling Outages

Mea culpa, mea mulpa, mea maxima culpa
Through my fault, through my fault, through my most grievous fault

I'm a post-Vatican II Catholic Boy but this sort of declaration is creeping back into the English translation of the mass in its confession of sinfulness.

I've been in industry long enough to see bugs in products I've been involved in have major impacts. In a previous life working on Frame Relay, ATM, and SONET/SDH I've seen major communications backbones go down as a result of software defects. In the storage world I've seen similar results when data becomes unavailable (or worse, lost). I've seen public webpages go down, I've seen stores unable to sell their products, university networks go down.

Several years ago I made the move from development to a more customer-facing role, a group dedicated to providing engineering support. (It occurs to me that in reality development is the ultimate in customer-facing roles...) In other words, when the official support personnel needing to talk to engineering, we were the people they'd talk to. We'd be the engineers who would be on calls with customers. In one of the first calls I was on I witnessed an understandably angry customer dealing with the fact it was a peak time of the year for their business and their store was offline as a result of a problem with our product. I watched a fairly senior person on the call who demonstrated a very important trait - that of empathy. We were far from having a root cause for the incident and there was no way to know at that early stage if it was a misconfiguration, a problem with a connected product, or a problem with our product. (I seem to recall it was a combination of all three.) But the most important thing that person did that early in the incident was insure the customer was aware that we knew, that we grokked, just how serious the problem was for them and that we were doing everything we could to get to a root cause. And to back that up. To have a communication plan and to follow-up on it.

Communication is tricky when an incident is still going on. Often when figuring out a problem you go down several false leads until you uncover the true root cause. I know that in my time in development I was always very loathe to go public with theories. However, this is the sort of thing that does need to be communicated to a customer, obviously with the caveat that it is an ongoing investigation.

One thing that needs to be kept in mind - the customer is going to want a root cause, but in nearly all cases, the customer is going to want to be back up and running as their highest priority. If it comes to a decision between gathering and analyzing data over several hours and performing an administrative action that will end the outage in minutes, most customers are going to want to be back up. Obviously this is something to discuss with customers. The point being though this is the customer's time and money. It is during development and testing that a vendor has the luxury of performing detailed analysis and experiments on a system suffering from an outage - and this points to how critical it is to "shift left" discovery of problems. A customer discovering a problem is a very expensive escape. Now in some situations a full root cause will be necessary in order to end the outage but this is not always the case.

Typically after an outage is over a customer is going to want to know more about why the problem occurred, when it will be fixed, how they can avoid hitting it again, etc. These are topics I'll be covering at a later point.

Image credit: ewe / 123RF Stock Photo

Tuesday, February 25, 2014

Too Much Availability

Late last year I picked up a few Christmas presents at Target, paying with my Visa Card. Recently I, like many others, received a replacement Visa Card from my bank. This is due to the massive theft of customer credit card numbers from Target.

While this blog naturally assumes that maximizing availability is a good thing, it carries with it the understood caveat that this availability is for authorized users only Data unavailability is definitely desired when an unauthorized user attempts to download credit card information.

In the same way that one must track defects that can cause loss of availability, one must also be aware of what can allow security vulnerabilities.

In his blog, Brian Krebs has shone some light on just what apparently happened at Target, with the caveat that much of this is unconfirmed (i.e. it might be all wrong. I will correct this as required). Essentially an HVAC vendor used by Target was a victim of a hacker attack and had its credentials unknowingly stolen. This allowed the hackers access to Target's external vendor billing system. From there they were able to gain access to customer credit card numbers (like mine apparently....) It is still unclear how it was even possible to make the leap from the vendor billing system to individual consumers' credit card data.

At this time I'm not going to dive into every possible improvement to be learned from this.The higher level point is while a product must have high availability, there are strong negative consequences to being available to the wrong people. The Target breach is a high-profile example. However, there's tons of little ones - stories of certain people's cloud-based pictures or emails briefly becoming publicly visible, hackers selling lists of social security numbers, etc. Just like faith in a product can be damaged by lack of availability, so too can it be damaged if the wrong people have access to data.




Thursday, February 13, 2014

Hey Look We're Famous

In my first post I mentioned sometimes six nines of availability isn't enough. So it was neat to see my prime product mentioned in the following tweet:

That said, as one of the many people who obsesses about maximizing availability, I work with the realization that one outage is one outage too many.

Wednesday, February 5, 2014

Availability Metrics

How reliable is your product?

Really reliable.

How often does it suffer an outage?

Never ever ever ever! It's super-duper reliable!

How do you know?


Back when I first began my career in another millennium six sigma was becoming a big buzzword. As a co-op at Pratt & Whitney I took my first class in it and I've received training in it throughout my career.

The process has its detractors, some pointing out it's just a method of formalizing some common sense techniques. In my mind its two endpoints - the final goal of predictable results and the starting point of needing data - are both critical when it comes to availability. Now being down all the time is certainly one way to achieve predictable results but I think we can safely assume that we'd rather have a whole bunch of nines when it comes to availability.

To get to that point you need to know where you are and where you've been. You don't know how available your product is unless you are measuring it. You don't know areas of vulnerability unless you keep track of what areas experience failures. You may develop a "gut feel" for problem areas but one should have data to back that up.

Various availability metrics require you to measure the performance of your product. Off of the top of my head here are some of the things one would want to measure in an enterprise storage product when measuring and improving availability.
  • Number of installations
  • Number of downtime events
  • Dates of downtime events
  • Duration of downtime events
  • Hours of run-time
  • Versions of software/hardware being used at downtime events
  • Trigger/stimulus that caused downtime events
  • Component responsible for downtime events
  • Fixes that went into various releases
When you're performing your own testing internally getting these numbers is not particularly difficult. However, the true measure of availability comes when a product is being used by customers in their actual use cases. This is where it behooves you to have a good relationship with your customers so you can get this valuable data from them. Of course in the enterprise it is quite likely that a vendor and customer will have service agreements which can  make much of this automatic.

The first thing these numbers can give us is a snapshot of our product's quality. We can get raw numbers for availability, mean times between failures, outage durations, etc.

We can also leverage this to help make decisions internally. If patterns emerge in typical triggers and/or responsible components then they can point the way to where improvements are needed. If a new release sees a spike in outages it will point the need for decisions like pulling a release and/or releasing a patch. And based on knowing where various fixes have been made coupled with knowing customer use cases can indicate which customers will benefit most from performing a software upgrade.

Though it is a delicate matter, sharing metrics with customers helps them make their own decisions in setting up their datacenters. Knowing what components are most vulnerable will illustrate where redundancy is required. And of course such knowledge will play a large part in what products they choose to purchase and what they are willing to pay.