Monday, August 18, 2014

Failing Safe

How does a system handle an "impossible" scenario? My background is originally as a software developer and it's a question the teams I've been on have often been faced with.

I'm going to do a little software explaining here but I'll endeavor to keep things pretty general. Software functions are designed to perform certain tasks with different parameters given to them each time. For example, a device driver writing to a traditional hard disk drive will need to know things like sector to write to, where the data to be written is, etc. One of the first things any function will do is validate the parameters it is dealing with. Often this is done with assertion statements. Basically you assert that a parameter needs to meet certain conditions - for example, the data that is being written has to be from a valid place. If it's not, the program will halt execution. That's great in a development environment - it stops things at the moment a problem occurs, allowing a developer to fix the problem right away. Sometimes when software ships the assertions will be turned off other times they will be left on under the assumption these conditions are impossible and if the software reaches them it is better to shut down than to continue operating.

This is something that is difficult to balance. The example I gave above is a trivial example - handling errors in that case is very easy. However, what is often the case is you are dealing with a complicated system where function A calls function B which calls function C which calls function D, all the way down to function Z. And there's numerous ways to get to function Z. And the same parameter is checked in all the preceding functions - and had been valid. Suddenly it is invalid. Often this happens due to a coding or design error - for example some asynchronous process stamps on the parameter, dodging protective measures designed to stop that from taking place.

I don't want to trivialize how difficult these problems are to discover. These "impossible" issues typically require a very convoluted sequence of events to manifest and are even more difficult to reproduce. There are  many methodologies in existence or being developed designed to prevent these issues from being introduced or to catch them if they are.

This long-winded prelude is not dedicated to the effort of preventing and catching these types of issues - something any product should absolutely be doing. What I am talking about here is what to do in those cases where you miss some permutation and are in an "impossible" situation. What is the best thing to do in those situations?

Under such circumstances, the most reliable systems will fail in the safest possible way. In a best-case scenario, the system will simply fail a single I/O operation. If that is not possible, it should fail in the least possible destructive manner. It would be better, for example, to take a single volume down than to take down an entire array. It would be better to fail a single blade than to fail an entire blade server. A good metaphor (and one now more used in computer networking than in its original concept) is that of a firewall. A physical firewall, while not able to put out a fire, is designed to stop it from spreading further.

While this may seem self-evident, realizing this typically means developing additional software whose main purpose is to fence off a problem that is often introduced by the software, a concept that may seem a bit odd. Essentially, the software enters into a mode designed to get back to a point of stability while doing the least possible damage.

One way of mitigating this challenge is not asking the software to do the impossible or even to fully recover without outside intervention. For example, if a storage volume needs to be taken offline due to an "impossible" situation, it would be nice if it could be brought up automatically but the main goal in a system failing safe would be to minimize the damage so as, for example, not to take other volumes down or to fail the entire system. In such cases it may take manual support intervention to bring that volume back online. Not great, but far preferable to no storage being available.

Some imagination is required when developing such protective measures. To again use a physical metaphor, consider the loss of RMS Titanic. She did have bulkheads designed to handle the possibility of flooding. However, they were inadequate, not going high enough, such that the water from one compartment was able to flood other compartments.

As I wrote earlier, it is far better for these situations to never arise and that should be the goal. However, in my many years in the software industry I have yet to find a perfect product. As such, systems should be designed to fail in the least destructive way.


Image Credits:
Nuke Image - Copyright: Eraxion / 123RF Stock Photo

"Titanic side plan annotated English" by Anonymous - Engineering journal: 'The White Star liner Titanic', vol.91.. Licensed under Public domain via Wikimedia Commons.

Friday, June 27, 2014

Can You Hear Me Now? Being on the Receiving End of a Data Unavailability




Oddly I find myself today (and yesterday) on the customer side of a Data Unavailability situation and it seemed to good an opportunity not to reflect upon it. Unlike much of what I write about, I don't know what the "true" story is. In some ways that makes it easier to write - as someone with some amount of inside knowledge I have to be very mindful to be respectful of company secrets and proprietary information. On the other hand, it's somewhat frustrating as it is both an annoyance to me in that it's stopping me from doing something I want to do and also because I'd really like to know what is going on.

Anyway, some background. I'm a techie. I love cycling through new phones, notebooks, tablets, etc. when finances allow and I've gotten good at navigating Gazelle, eBay, Amazon, etc. when it comes to selling my gently used gadgets to get the newest stuff. On the other hand my wife, while far from a technophobe, is perfectly content running a device into the ground, getting every bit of use out of it and upgrading only when she has a genuine need to do so. (Hmm, she's coming out a lot smarter than me...)

Anyways, my wife had finally reached the point that she was finding her phone, a Motorola Droid Razr Max no longer meeting her needs. Given that I'm the family IT department, she gave me her requirements and I came up with a few phones that seemed right for her and she decided on upgrading to a Samsung Galaxy S5.  So I went ahead and placed the order on Verizon, even splurging on overnight shipping.

(It's probably worth noting that despite my annoyance with Verizon right now, in general I've been happy with our service from Verizon Wireless - though if they really want to make it up to me they can open up to Nexus phones...)

So the phone arrives yesterday. We try to activate it but no luck. I call Verizon customer support and they tell me they are upgrading their systems and to try again in about an hour. No problem. But an hour later still no luck. Reading about other ways to activate a phone I see there's a way on My Verizon to do so. But then I discover that's not working either. (As you an see by the screen captures.) Some Googling reveals that this had been a problem since yesterday morning around 5 AM-ish. So it had been going on for over 14 hours by this point.

Next morning arrives. Still the same problem. And as I write this it's around 9:30 PM on day 2 of the outage. I did learn a bit more about what's going on, but mainly through pursuing Twitter and other sources - and not from Verizon, which has not issued any statements beyond a note after logging on to "My Verizon" that some services are currently unavailable and a pair of tweets issued quite recently....


The bad thing I've learned about the billing system being down, beyond the obvious inability to pay bills, is one cannot activate a new phone. Similarly people trying to put in insurance claims for broken phones are unable to verify they own Verizon phones.

On one hand, it's a First World Problem. For my family, it's a minor annoyance, delaying our ability to activate a shiny new phone.  On the other hand, for some people it is proving a major inconvenience; such as those who use their Verizon phone as important parts of their personal or business lives and are unable to replace a broken phone or perform similar activities. And I've read it is agony on people trying to sell Verizon phones.

What I am observing from the outside is just how frustrating it is not knowing what's going on. Being unable to do what you've paid for (or to even make a payment) is frustrating enough. However, the lack of communication adds to the frustration. The latest tweet (indicating the issue has been identified) is a good communication but similar tweets should have been issued periodically - i.e. "we are working to identify the issue", "we have called in additional support", etc. - I understand there are certain things they are not at liberty to disclose but at the end of the day there is a certain responsibility to paying customers. Indeed it might not even be their "fault" in the sense that it may be an outside vendor's equipment which is failing (and if so I'm crossing fingers that it's nothing I've ever worked on in my career) but at the end of the day (well two days in this case) the service that they provide to their customers is compromised.

I'll also admit I'm super-curious what happened.

[Update - and the services are now back up.]

Sunday, May 25, 2014

Don't Press the Jolly Candy-Like Button


How can he possibly resist the maddening urge to erradicate history at the mere push of a single button? The beautiful, shiny button? The jolly, candy-like button? Will he hold out, folks? Can he hold out?

- Ren & Stimpy - "The Button"




So a customer has had a Data Unavailability Event and you have a root cause. Or better yet, you discover it internally before any customer encounters the problem. (Better still would have been to discover and fix the problem before shipping.)

This brings you to a difficult question - what do you tell your customer base about the problems you know about? Note that nothing I write in the following paragraphs should be construed to mean you should ever keep information from your customers - rather this represents some thoughts as to how you should communicate with your customers.

First and foremost the need to do this sort of communication is indicative of a failure somewhere in your release product. I've worked for many companies and have yet to find one that doesn't experience these failures. The better ones however have processes in place to learn from these mistakes and improve their quality going forward.

So back to the original premise. What do you tell your customers? And how do you communicate with them? And the answer to both of these is the infamous "it depends".

I've seen the following types of criteria used:
  • Is it avoidable? Certain types of problems are consistently triggered by a certain type of use case under certain (presumably unusual) conditions. There may be a workaround to fulfill the use case without risking the Data Unavailability condition. In any case, since this is indicative of a preventable problem this points to the type of communication which should be loud and clear. In these cases you really should consider pushing these sorts of communication aggressively. 
  • Is it an upgrade issue? I've seen some issues that happen during or after an upgrade. Sometimes this is due to a regression, sometimes due to something in the upgrade process, sometimes due to a new feature which may have an unintended effect. In these conditions you'd want to make certain that customers are aware of the potential issues prior to performing the upgrade. Including a list of these known issues in your upgrade documentation (and being able to update this documentation as new issues are discovered) allow your customers to make an intelligent decision regarding an upgrade. Some customers may be better staying at an older version until a patch with a fix is available. Others may want to take a preventative action prior to performing the upgrade.
  • Is it a likely issue? This is not to say you should not tell customers about unlikely issues. Rather, it is indicative as to whether or not you should use aggressive broadcasts to make certain they are aware of the issue. If there's no workaround for a likely issue but a patch is available you absolutely want to issue an alert to your customers.
  • Is this a severe issue? Any outage is severe. But if there is something with potential for a catastrophic outage with a long recovery time then you need to give serious conditions to broadcasting the problem instead of release noting it.
  • Is there a recovery process? After a problem is hit, if there is some process to recover there should be some documentation easily located by customers to tell them how to get out of the situation. Knowledgebase Articles are a popular means of doing this.
Even an unlikely problem which has no avoidance mechanism and will automatically recover should be documented. An up to date set of release notes is a good place to put such information (often in addition to the above possibilities). What would a customer want to know? Typical items include:
  • What happens if the problem is hit?
  • Under what conditions will the problem be hit?
  • What is the likelihood of hitting the problem? Is it deterministic?
  • What can be done to avoid the problem?
  • What needs to be done to recover from the problem?
  • What versions of software are vulnerable to the problem.
The worst thing you could find yourself doing is explaining to a customer how you knew about the potential for a problem but didn't tell them,

Sunday, March 23, 2014

Mea Culpa, Mea Culpa, Mea Maxima Culpa - Handling Outages

Mea culpa, mea mulpa, mea maxima culpa
Through my fault, through my fault, through my most grievous fault

I'm a post-Vatican II Catholic Boy but this sort of declaration is creeping back into the English translation of the mass in its confession of sinfulness.

I've been in industry long enough to see bugs in products I've been involved in have major impacts. In a previous life working on Frame Relay, ATM, and SONET/SDH I've seen major communications backbones go down as a result of software defects. In the storage world I've seen similar results when data becomes unavailable (or worse, lost). I've seen public webpages go down, I've seen stores unable to sell their products, university networks go down.

Several years ago I made the move from development to a more customer-facing role, a group dedicated to providing engineering support. (It occurs to me that in reality development is the ultimate in customer-facing roles...) In other words, when the official support personnel needing to talk to engineering, we were the people they'd talk to. We'd be the engineers who would be on calls with customers. In one of the first calls I was on I witnessed an understandably angry customer dealing with the fact it was a peak time of the year for their business and their store was offline as a result of a problem with our product. I watched a fairly senior person on the call who demonstrated a very important trait - that of empathy. We were far from having a root cause for the incident and there was no way to know at that early stage if it was a misconfiguration, a problem with a connected product, or a problem with our product. (I seem to recall it was a combination of all three.) But the most important thing that person did that early in the incident was insure the customer was aware that we knew, that we grokked, just how serious the problem was for them and that we were doing everything we could to get to a root cause. And to back that up. To have a communication plan and to follow-up on it.

Communication is tricky when an incident is still going on. Often when figuring out a problem you go down several false leads until you uncover the true root cause. I know that in my time in development I was always very loathe to go public with theories. However, this is the sort of thing that does need to be communicated to a customer, obviously with the caveat that it is an ongoing investigation.

One thing that needs to be kept in mind - the customer is going to want a root cause, but in nearly all cases, the customer is going to want to be back up and running as their highest priority. If it comes to a decision between gathering and analyzing data over several hours and performing an administrative action that will end the outage in minutes, most customers are going to want to be back up. Obviously this is something to discuss with customers. The point being though this is the customer's time and money. It is during development and testing that a vendor has the luxury of performing detailed analysis and experiments on a system suffering from an outage - and this points to how critical it is to "shift left" discovery of problems. A customer discovering a problem is a very expensive escape. Now in some situations a full root cause will be necessary in order to end the outage but this is not always the case.

Typically after an outage is over a customer is going to want to know more about why the problem occurred, when it will be fixed, how they can avoid hitting it again, etc. These are topics I'll be covering at a later point.

Image credit: ewe / 123RF Stock Photo

Tuesday, February 25, 2014

Too Much Availability

Late last year I picked up a few Christmas presents at Target, paying with my Visa Card. Recently I, like many others, received a replacement Visa Card from my bank. This is due to the massive theft of customer credit card numbers from Target.

While this blog naturally assumes that maximizing availability is a good thing, it carries with it the understood caveat that this availability is for authorized users only Data unavailability is definitely desired when an unauthorized user attempts to download credit card information.

In the same way that one must track defects that can cause loss of availability, one must also be aware of what can allow security vulnerabilities.

In his blog, Brian Krebs has shone some light on just what apparently happened at Target, with the caveat that much of this is unconfirmed (i.e. it might be all wrong. I will correct this as required). Essentially an HVAC vendor used by Target was a victim of a hacker attack and had its credentials unknowingly stolen. This allowed the hackers access to Target's external vendor billing system. From there they were able to gain access to customer credit card numbers (like mine apparently....) It is still unclear how it was even possible to make the leap from the vendor billing system to individual consumers' credit card data.

At this time I'm not going to dive into every possible improvement to be learned from this.The higher level point is while a product must have high availability, there are strong negative consequences to being available to the wrong people. The Target breach is a high-profile example. However, there's tons of little ones - stories of certain people's cloud-based pictures or emails briefly becoming publicly visible, hackers selling lists of social security numbers, etc. Just like faith in a product can be damaged by lack of availability, so too can it be damaged if the wrong people have access to data.




Thursday, February 13, 2014

Hey Look We're Famous

In my first post I mentioned sometimes six nines of availability isn't enough. So it was neat to see my prime product mentioned in the following tweet:

That said, as one of the many people who obsesses about maximizing availability, I work with the realization that one outage is one outage too many.

Wednesday, February 5, 2014

Availability Metrics

How reliable is your product?

Really reliable.

How often does it suffer an outage?

Never ever ever ever! It's super-duper reliable!

How do you know?


Back when I first began my career in another millennium six sigma was becoming a big buzzword. As a co-op at Pratt & Whitney I took my first class in it and I've received training in it throughout my career.

The process has its detractors, some pointing out it's just a method of formalizing some common sense techniques. In my mind its two endpoints - the final goal of predictable results and the starting point of needing data - are both critical when it comes to availability. Now being down all the time is certainly one way to achieve predictable results but I think we can safely assume that we'd rather have a whole bunch of nines when it comes to availability.

To get to that point you need to know where you are and where you've been. You don't know how available your product is unless you are measuring it. You don't know areas of vulnerability unless you keep track of what areas experience failures. You may develop a "gut feel" for problem areas but one should have data to back that up.

Various availability metrics require you to measure the performance of your product. Off of the top of my head here are some of the things one would want to measure in an enterprise storage product when measuring and improving availability.
  • Number of installations
  • Number of downtime events
  • Dates of downtime events
  • Duration of downtime events
  • Hours of run-time
  • Versions of software/hardware being used at downtime events
  • Trigger/stimulus that caused downtime events
  • Component responsible for downtime events
  • Fixes that went into various releases
When you're performing your own testing internally getting these numbers is not particularly difficult. However, the true measure of availability comes when a product is being used by customers in their actual use cases. This is where it behooves you to have a good relationship with your customers so you can get this valuable data from them. Of course in the enterprise it is quite likely that a vendor and customer will have service agreements which can  make much of this automatic.

The first thing these numbers can give us is a snapshot of our product's quality. We can get raw numbers for availability, mean times between failures, outage durations, etc.

We can also leverage this to help make decisions internally. If patterns emerge in typical triggers and/or responsible components then they can point the way to where improvements are needed. If a new release sees a spike in outages it will point the need for decisions like pulling a release and/or releasing a patch. And based on knowing where various fixes have been made coupled with knowing customer use cases can indicate which customers will benefit most from performing a software upgrade.

Though it is a delicate matter, sharing metrics with customers helps them make their own decisions in setting up their datacenters. Knowing what components are most vulnerable will illustrate where redundancy is required. And of course such knowledge will play a large part in what products they choose to purchase and what they are willing to pay.

Tuesday, January 28, 2014

Boom - How Do You Protect Data if the Datacenter Ceases to Exist?


Sometimes corporate videos can be a bit on the cheesy side but I really like the one I'm posting above as it allows me to explain to my family one of the main things the product I work on, VPLEX, does. It also gets some extra-credit as early in my career with EMC I had the opportunity to work with Steve Todd, one of the people in the video.

One of the ultimate problems in protecting data is what to do if the datacenter actually ceases to exist. In my career I've seen this happen for many reasons. On the low-end, there is the problem of power outages or maintenance windows which temporarily take a datacenter offline - the datacenter ceases to exist for a finite amount of time. On the more extreme end the datacenter can cease to exist permanently. Extreme weather, often accompanied by flooding, is one culprit that takes out datacenters. Similarly bolts of lightning have destroyed datacenters. Though the loss of data pales next to the loss of life, the September 11 terrorist attacks did illustrate how deliberate acts of destruction could also affect data. And as I mentioned in my first post, loss of data can in some cases lead to loss of life. While researching this topc, I found an article at PC World quoting Leib Lurie, CEO of One Call Now:
The 9/11 attacks "geared people toward a completely different way of thinking," Lurie said. "Everyone has always had backup and colocation and back-up plans, every large company has. After 9/11 and [Hurricane] Katrina and the string of other things, even a three-person law firm, a three-person insurance agency, a doctor with his files, if your building gets wiped out and you have six decades of files, not only is your business gone, not only is your credibility gone, but you're putting hundreds of lives at risk." 
The loss of a doctor's records could be fatal in some cases, and with the loss of a law firm's records, "you could have people tied in knots legally until you find alternative records, if you find them," Lurie said.

There's various options that a storage administrator can take to protect a datacenter. At the very least there would need to be a backup stored off-site, with backup methods ranging from periodic snapshots to continuous data replication. And, amazingly enough,  EMC provides solutions for both these options, with Mozy and RecoverPoint. (Hey, I warned you that while I'm not writing on behalf of my employer I'm still a fan.). Mozy is geared more for the PC and Mac environment, taking periodic snapshots and backing them up to the Mozy-provided cloud whereas RecoverPoint  uses journaling to keep track of every single write, allowing for very precise rollbacks.

These options allow for disaster recovery. If the disaster occurs you will experience an outage, albeit one you can recover from. As my family's IT manager I find that suits our needs very well - when my wife replaced her laptop we simply told Mozy to restore to a new laptop. I myself tend to be more in the cloud full-time and use Google Drive as my main storage, giving me replication and the ability to recover.

However, while recovering from an outage without loss of data (or minimal loss with some solutions) is fantastic, many enterprise solutions need continuous availability. Recovery from a disaster is not sufficient. I know I would have been rather annoyed if my bank told me my information was not available while they recovered from backup in the aftermath of one of the many storms that have hit us here in Massachusetts over the past several years. That's where having a solution which allows for continuous availability even in the event of the destruction of a datacenter, becomes essential. That's one of the things that my product, VPLEX does - you are able to mirror writes to two sites separated by substantial distances. And just as importantly, it is possible to perform read the same data from either datacenter. If one datacenter ceases to exist, the other one is still up and is able to continue operating (as the rather dramatic video at the beginning of this post illustrates). And for even more protection, many customers combine the RecoverPoint and VPLEX products, allowing for both rollback and continuous data protection.

All of this comes at a cost, making users balance their availability needs vs. their budget. Not all applications need continuous availability. But those that do need to be able to endure a wide variety of potential problems, ranging from maintenance windows all the way to the datacenter destruction described here. And providing these solutions presents challenges a vendor must address. For example, some of the most obvious include:

  • What does an availability solution tell a host a data write has completed? This question becomes more and more critical as the latency between sites increases.
  • What does an availability solution do if a remote site cannot be seen? How does it determine if the remote site has had an outage or if the communications link has been severed?
  • How is re-synchronization handled when two or more separated sites are brought back together?


Monday, January 27, 2014

What is an Outage?

One of the topics I'm going to be talking about is that of metrics.In order to have any hope of knowing a product's availability you've got to measure how often it goes down and for how long. You've got to know what the product's install base is at any given time. These are topics I'm going to delve into as this blog matures.

But let's start with what might seem simple - the question "what is an outage?" Sometimes the answer is very obvious. If a SAN switch just goes offline then that switch has clearly experienced an outage.

On the other hand, sometimes the answer is less obvious. For example, consider the following typical arrangement:


You've got your host connected to a storage array via two switches, one of which has failed. On the host is some multipathing software which routes any I/O request to the storage array through one of the two switches. If your product was responsible for providing multipathing on the host then the outage's responsibility is not solely with the switch vendor - indeed a customer may view it to be solely with the provider of multipathing software if they had planned their environment accounting for the fact these hypothetical switches have a lower availability.

Let's add some complexity to the mix. What if the multipathing software is working fine but the customer did not configure it correctly? For example suppose that it was setup to use only one path and it would only fail over to the other path upon a manual request. This brings to mind one of my least favorite phrases - "customer error". But one must be extremely unwilling to make the root cause for any outage. Was the multipathing documentation clear? Did it alert the user to the fact that one failure could cause an outage? 

And consider taking it to a greater extreme. Suppose the multipathing software is configured perfectly and after the switch fails all I/O is routed to the other switch. But then suppose a few hours later the other switch fails too. Is the multipathing software absolved? Not necessarily. Did the multipathing software make it clear to the user that it is one failure away from unavailability? And making it clear is vital. Is that information embedded in some log or is it in some alarm that screams for an administrator's attention?

At the end of the day, an outage really is defined as "whatever the customer says it is". And those who are truly working at maximizing availability will go beyond even that definition. You want to provide a product that a customer will not only not worry about but will relieve worry, secure in the knowledge that your product is there. When you are providing a product at the enterprise level, any failures of your product have consequences. At the very least, the person who signed off on purchasing your product may have his or her job jeopardized. Beyond that, money can be lost, power grids can go offline, organizations can be unable to operate, and people can actually die. 

Thursday, January 23, 2014

Welcome

Welcome to my professional corner of the internet. And who am I?

I'm Dan Stack, a software engineer, currently employed by EMC Corporation. And first things first, while I intend this blog to deal with "professional" topics in the world of storage, virtualization, etc., the opinions in this blog are my own. I am not representing my employer and am not posting on my employer's behalf. Though I'm clearly going to be somewhat biased. I like the products I work on, the people I work with, and the job that I do.

So, to quote Office Space, "What would ya say...ya do here?" I've got a few roles at present. I'm a software engineer on our VPLEX platform. Specifically I am both our Defect Czar and our Software DU/DL Lead. I'll likely go into more detail over time what that means. But the short form is one of my main jobs is to serve as our conscience, to be the voice of the customer, and to help drive software architecture decisions with that in mind. I sit in an unusual space between software development, quality engineering, and customer support. Bringing me to this point I've had years and years of experience in software development along with experience in advanced customer support.

Over the years I've become very passionate about the quality, reliability, and availability of the products I've worked on. I've grown to feel that one of my least favorite phrases is "customer error". For some in the development world it's a difficult journey - sometimes you look at an error that occurs in the field and you wonder "what were they thinking?" And typically, the answer is "something you didn't account for"? I imagine it's impossible to make a 100% bulletproof product, but I do believe that that's a worthy goal to aspire towards. An example of this in another industry can be seen in improvements made to automobiles. One could argue pretty reasonably that it is not the responsibility of a vehicle to keep its driver awake. But given the danger of drivers falling asleep, there have been advances in automotive safety to detect drowsy drivers. Sure, this is correcting the driver's "fault". But a quick search of the internet will find several studies that indicate driver drowsiness/sleeping is at fault for a significant number of automobile accidents.

Can someone die from a datacenter outage? Almost certainly. Consider how advances in technology have given to computerized monitoring of patients. Now consider all the monitors in a hospital intensive care unit or neonatal care unit going offline.

Without even trying very hard I could think of tons of examples of real-world impacts from datacenter outages, ranging from inconvenient (an MMO game going offline) to a financial disaster (lost or erroneous transactions) to deadly (hospital outages). And this doesn't even begin to cover malicious intent, as can be seen in many news stories of stolen credit cards.

In this blog I hope to talk about these kinds of issues. What can go wrong. What can be done to protect users. What can be learned from these sorts of events. What are some success stories. My own area of expertise is in virtualization but I'm going to try to not limit myself to that. One thing I won't be doing is divulging any proprietary information that I have access to. If I ever go beyond the hypothetical in issues that I've worked with personally I'll be certain to use only publicly available information.

I don't view this as a blog likely to be updated on a daily basis. I'm going to shoot for an update every few weeks, though I imagine early on I'll be updating a bit more frequently.I imagine there might be a little jumpiness early on as well as I find my voice in this blog (I've done and continue to do other blogs, though those have been more hobby/personal-based.)

Why the title six-nines? What that means is your system will be up (and available!) 99.9999% of the time. In a typical year you could expect about 30 seconds of downtime. It's often considered a golden number, but in some applications that too is insufficient. There's other measures that I'll be discussing such as mean time between failure, outage duration time, etc. (For example, if there's 30 seconds of downtime in a typical year is that all at once, a six five-second outages, one minute over two years, etc.)