Thursday, January 23, 2014

Welcome

Welcome to my professional corner of the internet. And who am I?

I'm Dan Stack, a software engineer, currently employed by EMC Corporation. And first things first, while I intend this blog to deal with "professional" topics in the world of storage, virtualization, etc., the opinions in this blog are my own. I am not representing my employer and am not posting on my employer's behalf. Though I'm clearly going to be somewhat biased. I like the products I work on, the people I work with, and the job that I do.

So, to quote Office Space, "What would ya say...ya do here?" I've got a few roles at present. I'm a software engineer on our VPLEX platform. Specifically I am both our Defect Czar and our Software DU/DL Lead. I'll likely go into more detail over time what that means. But the short form is one of my main jobs is to serve as our conscience, to be the voice of the customer, and to help drive software architecture decisions with that in mind. I sit in an unusual space between software development, quality engineering, and customer support. Bringing me to this point I've had years and years of experience in software development along with experience in advanced customer support.

Over the years I've become very passionate about the quality, reliability, and availability of the products I've worked on. I've grown to feel that one of my least favorite phrases is "customer error". For some in the development world it's a difficult journey - sometimes you look at an error that occurs in the field and you wonder "what were they thinking?" And typically, the answer is "something you didn't account for"? I imagine it's impossible to make a 100% bulletproof product, but I do believe that that's a worthy goal to aspire towards. An example of this in another industry can be seen in improvements made to automobiles. One could argue pretty reasonably that it is not the responsibility of a vehicle to keep its driver awake. But given the danger of drivers falling asleep, there have been advances in automotive safety to detect drowsy drivers. Sure, this is correcting the driver's "fault". But a quick search of the internet will find several studies that indicate driver drowsiness/sleeping is at fault for a significant number of automobile accidents.

Can someone die from a datacenter outage? Almost certainly. Consider how advances in technology have given to computerized monitoring of patients. Now consider all the monitors in a hospital intensive care unit or neonatal care unit going offline.

Without even trying very hard I could think of tons of examples of real-world impacts from datacenter outages, ranging from inconvenient (an MMO game going offline) to a financial disaster (lost or erroneous transactions) to deadly (hospital outages). And this doesn't even begin to cover malicious intent, as can be seen in many news stories of stolen credit cards.

In this blog I hope to talk about these kinds of issues. What can go wrong. What can be done to protect users. What can be learned from these sorts of events. What are some success stories. My own area of expertise is in virtualization but I'm going to try to not limit myself to that. One thing I won't be doing is divulging any proprietary information that I have access to. If I ever go beyond the hypothetical in issues that I've worked with personally I'll be certain to use only publicly available information.

I don't view this as a blog likely to be updated on a daily basis. I'm going to shoot for an update every few weeks, though I imagine early on I'll be updating a bit more frequently.I imagine there might be a little jumpiness early on as well as I find my voice in this blog (I've done and continue to do other blogs, though those have been more hobby/personal-based.)

Why the title six-nines? What that means is your system will be up (and available!) 99.9999% of the time. In a typical year you could expect about 30 seconds of downtime. It's often considered a golden number, but in some applications that too is insufficient. There's other measures that I'll be discussing such as mean time between failure, outage duration time, etc. (For example, if there's 30 seconds of downtime in a typical year is that all at once, a six five-second outages, one minute over two years, etc.)

1 comment:

  1. Great "hello world" post, Dan! I'm looking forward to seeing what you add here in the future. I see great potential :-)

    ReplyDelete