Six Nines: January 2014

Tuesday, January 28, 2014

Boom - How Do You Protect Data if the Datacenter Ceases to Exist?

Sometimes corporate videos can be a bit on the cheesy side but I really like the one I'm posting above as it allows me to explain to my family one of the main things the product I work on, VPLEX, does. It also gets some extra-credit as early in my career with EMC I had the opportunity to work with Steve Todd, one of the people in the video.

One of the ultimate problems in protecting data is what to do if the datacenter actually ceases to exist. In my career I've seen this happen for many reasons. On the low-end, there is the problem of power outages or maintenance windows which temporarily take a datacenter offline - the datacenter ceases to exist for a finite amount of time. On the more extreme end the datacenter can cease to exist permanently. Extreme weather, often accompanied by flooding, is one culprit that takes out datacenters. Similarly bolts of lightning have destroyed datacenters. Though the loss of data pales next to the loss of life, the September 11 terrorist attacks did illustrate how deliberate acts of destruction could also affect data. And as I mentioned in my first post, loss of data can in some cases lead to loss of life. While researching this topc, I found an article at PC World quoting Leib Lurie, CEO of One Call Now:

The 9/11 attacks "geared people toward a completely different way of thinking," Lurie said. "Everyone has always had backup and colocation and back-up plans, every large company has. After 9/11 and [Hurricane] Katrina and the string of other things, even a three-person law firm, a three-person insurance agency, a doctor with his files, if your building gets wiped out and you have six decades of files, not only is your business gone, not only is your credibility gone, but you're putting hundreds of lives at risk."

The loss of a doctor's records could be fatal in some cases, and with the loss of a law firm's records, "you could have people tied in knots legally until you find alternative records, if you find them," Lurie said.

There's various options that a storage administrator can take to protect a datacenter. At the very least there would need to be a backup stored off-site, with backup methods ranging from periodic snapshots to continuous data replication. And, amazingly enough, EMC provides solutions for both these options, with Mozy and RecoverPoint. (Hey, I warned you that while I'm not writing on behalf of my employer I'm still a fan.). Mozy is geared more for the PC and Mac environment, taking periodic snapshots and backing them up to the Mozy-provided cloud whereas RecoverPoint uses journaling to keep track of every single write, allowing for very precise rollbacks.

These options allow for disaster recovery. If the disaster occurs you will experience an outage, albeit one you can recover from. As my family's IT manager I find that suits our needs very well - when my wife replaced her laptop we simply told Mozy to restore to a new laptop. I myself tend to be more in the cloud full-time and use Google Drive as my main storage, giving me replication and the ability to recover.

However, while recovering from an outage without loss of data (or minimal loss with some solutions) is fantastic, many enterprise solutions need continuous availability. Recovery from a disaster is not sufficient. I know I would have been rather annoyed if my bank told me my information was not available while they recovered from backup in the aftermath of one of the many storms that have hit us here in Massachusetts over the past several years. That's where having a solution which allows for continuous availability even in the event of the destruction of a datacenter, becomes essential. That's one of the things that my product, VPLEX does - you are able to mirror writes to two sites separated by substantial distances. And just as importantly, it is possible to perform read the same data from either datacenter. If one datacenter ceases to exist, the other one is still up and is able to continue operating (as the rather dramatic video at the beginning of this post illustrates). And for even more protection, many customers combine the RecoverPoint and VPLEX products, allowing for both rollback and continuous data protection.

All of this comes at a cost, making users balance their availability needs vs. their budget. Not all applications need continuous availability. But those that do need to be able to endure a wide variety of potential problems, ranging from maintenance windows all the way to the datacenter destruction described here. And providing these solutions presents challenges a vendor must address. For example, some of the most obvious include:

What does an availability solution tell a host a data write has completed? This question becomes more and more critical as the latency between sites increases.
What does an availability solution do if a remote site cannot be seen? How does it determine if the remote site has had an outage or if the communications link has been severed?
How is re-synchronization handled when two or more separated sites are brought back together?

Monday, January 27, 2014

What is an Outage?

One of the topics I'm going to be talking about is that of metrics.In order to have any hope of knowing a product's availability you've got to measure how often it goes down and for how long. You've got to know what the product's install base is at any given time. These are topics I'm going to delve into as this blog matures.

But let's start with what might seem simple - the question "what is an outage?" Sometimes the answer is very obvious. If a SAN switch just goes offline then that switch has clearly experienced an outage.

On the other hand, sometimes the answer is less obvious. For example, consider the following typical arrangement:

You've got your host connected to a storage array via two switches, one of which has failed. On the host is some multipathing software which routes any I/O request to the storage array through one of the two switches. If your product was responsible for providing multipathing on the host then the outage's responsibility is not solely with the switch vendor - indeed a customer may view it to be solely with the provider of multipathing software if they had planned their environment accounting for the fact these hypothetical switches have a lower availability.

Let's add some complexity to the mix. What if the multipathing software is working fine but the customer did not configure it correctly? For example suppose that it was setup to use only one path and it would only fail over to the other path upon a manual request. This brings to mind one of my least favorite phrases - "customer error". But one must be extremely unwilling to make the root cause for any outage. Was the multipathing documentation clear? Did it alert the user to the fact that one failure could cause an outage?

And consider taking it to a greater extreme. Suppose the multipathing software is configured perfectly and after the switch fails all I/O is routed to the other switch. But then suppose a few hours later the other switch fails too. Is the multipathing software absolved? Not necessarily. Did the multipathing software make it clear to the user that it is one failure away from unavailability? And making it clear is vital. Is that information embedded in some log or is it in some alarm that screams for an administrator's attention?

At the end of the day, an outage really is defined as "whatever the customer says it is". And those who are truly working at maximizing availability will go beyond even that definition. You want to provide a product that a customer will not only not worry about but will relieve worry, secure in the knowledge that your product is there. When you are providing a product at the enterprise level, any failures of your product have consequences. At the very least, the person who signed off on purchasing your product may have his or her job jeopardized. Beyond that, money can be lost, power grids can go offline, organizations can be unable to operate, and people can actually die.

Thursday, January 23, 2014

Welcome

Welcome to my professional corner of the internet. And who am I?

I'm Dan Stack, a software engineer, currently employed by EMC Corporation. And first things first, while I intend this blog to deal with "professional" topics in the world of storage, virtualization, etc., the opinions in this blog are my own. I am not representing my employer and am not posting on my employer's behalf. Though I'm clearly going to be somewhat biased. I like the products I work on, the people I work with, and the job that I do.

So, to quote Office Space, "What would ya say...ya do here?" I've got a few roles at present. I'm a software engineer on our VPLEX platform. Specifically I am both our Defect Czar and our Software DU/DL Lead. I'll likely go into more detail over time what that means. But the short form is one of my main jobs is to serve as our conscience, to be the voice of the customer, and to help drive software architecture decisions with that in mind. I sit in an unusual space between software development, quality engineering, and customer support. Bringing me to this point I've had years and years of experience in software development along with experience in advanced customer support.

Over the years I've become very passionate about the quality, reliability, and availability of the products I've worked on. I've grown to feel that one of my least favorite phrases is "customer error". For some in the development world it's a difficult journey - sometimes you look at an error that occurs in the field and you wonder "what were they thinking?" And typically, the answer is "something you didn't account for"? I imagine it's impossible to make a 100% bulletproof product, but I do believe that that's a worthy goal to aspire towards. An example of this in another industry can be seen in improvements made to automobiles. One could argue pretty reasonably that it is not the responsibility of a vehicle to keep its driver awake. But given the danger of drivers falling asleep, there have been advances in automotive safety to detect drowsy drivers. Sure, this is correcting the driver's "fault". But a quick search of the internet will find several studies that indicate driver drowsiness/sleeping is at fault for a significant number of automobile accidents.

Can someone die from a datacenter outage? Almost certainly. Consider how advances in technology have given to computerized monitoring of patients. Now consider all the monitors in a hospital intensive care unit or neonatal care unit going offline.

Without even trying very hard I could think of tons of examples of real-world impacts from datacenter outages, ranging from inconvenient (an MMO game going offline) to a financial disaster (lost or erroneous transactions) to deadly (hospital outages). And this doesn't even begin to cover malicious intent, as can be seen in many news stories of stolen credit cards.

In this blog I hope to talk about these kinds of issues. What can go wrong. What can be done to protect users. What can be learned from these sorts of events. What are some success stories. My own area of expertise is in virtualization but I'm going to try to not limit myself to that. One thing I won't be doing is divulging any proprietary information that I have access to. If I ever go beyond the hypothetical in issues that I've worked with personally I'll be certain to use only publicly available information.

I don't view this as a blog likely to be updated on a daily basis. I'm going to shoot for an update every few weeks, though I imagine early on I'll be updating a bit more frequently.I imagine there might be a little jumpiness early on as well as I find my voice in this blog (I've done and continue to do other blogs, though those have been more hobby/personal-based.)

Why the title six-nines? What that means is your system will be up (and available!) 99.9999% of the time. In a typical year you could expect about 30 seconds of downtime. It's often considered a golden number, but in some applications that too is insufficient. There's other measures that I'll be discussing such as mean time between failure, outage duration time, etc. (For example, if there's 30 seconds of downtime in a typical year is that all at once, a six five-second outages, one minute over two years, etc.)