Monday, August 18, 2014

Failing Safe

How does a system handle an "impossible" scenario? My background is originally as a software developer and it's a question the teams I've been on have often been faced with.

I'm going to do a little software explaining here but I'll endeavor to keep things pretty general. Software functions are designed to perform certain tasks with different parameters given to them each time. For example, a device driver writing to a traditional hard disk drive will need to know things like sector to write to, where the data to be written is, etc. One of the first things any function will do is validate the parameters it is dealing with. Often this is done with assertion statements. Basically you assert that a parameter needs to meet certain conditions - for example, the data that is being written has to be from a valid place. If it's not, the program will halt execution. That's great in a development environment - it stops things at the moment a problem occurs, allowing a developer to fix the problem right away. Sometimes when software ships the assertions will be turned off other times they will be left on under the assumption these conditions are impossible and if the software reaches them it is better to shut down than to continue operating.

This is something that is difficult to balance. The example I gave above is a trivial example - handling errors in that case is very easy. However, what is often the case is you are dealing with a complicated system where function A calls function B which calls function C which calls function D, all the way down to function Z. And there's numerous ways to get to function Z. And the same parameter is checked in all the preceding functions - and had been valid. Suddenly it is invalid. Often this happens due to a coding or design error - for example some asynchronous process stamps on the parameter, dodging protective measures designed to stop that from taking place.

I don't want to trivialize how difficult these problems are to discover. These "impossible" issues typically require a very convoluted sequence of events to manifest and are even more difficult to reproduce. There are  many methodologies in existence or being developed designed to prevent these issues from being introduced or to catch them if they are.

This long-winded prelude is not dedicated to the effort of preventing and catching these types of issues - something any product should absolutely be doing. What I am talking about here is what to do in those cases where you miss some permutation and are in an "impossible" situation. What is the best thing to do in those situations?

Under such circumstances, the most reliable systems will fail in the safest possible way. In a best-case scenario, the system will simply fail a single I/O operation. If that is not possible, it should fail in the least possible destructive manner. It would be better, for example, to take a single volume down than to take down an entire array. It would be better to fail a single blade than to fail an entire blade server. A good metaphor (and one now more used in computer networking than in its original concept) is that of a firewall. A physical firewall, while not able to put out a fire, is designed to stop it from spreading further.

While this may seem self-evident, realizing this typically means developing additional software whose main purpose is to fence off a problem that is often introduced by the software, a concept that may seem a bit odd. Essentially, the software enters into a mode designed to get back to a point of stability while doing the least possible damage.

One way of mitigating this challenge is not asking the software to do the impossible or even to fully recover without outside intervention. For example, if a storage volume needs to be taken offline due to an "impossible" situation, it would be nice if it could be brought up automatically but the main goal in a system failing safe would be to minimize the damage so as, for example, not to take other volumes down or to fail the entire system. In such cases it may take manual support intervention to bring that volume back online. Not great, but far preferable to no storage being available.

Some imagination is required when developing such protective measures. To again use a physical metaphor, consider the loss of RMS Titanic. She did have bulkheads designed to handle the possibility of flooding. However, they were inadequate, not going high enough, such that the water from one compartment was able to flood other compartments.

As I wrote earlier, it is far better for these situations to never arise and that should be the goal. However, in my many years in the software industry I have yet to find a perfect product. As such, systems should be designed to fail in the least destructive way.


Image Credits:
Nuke Image - Copyright: Eraxion / 123RF Stock Photo

"Titanic side plan annotated English" by Anonymous - Engineering journal: 'The White Star liner Titanic', vol.91.. Licensed under Public domain via Wikimedia Commons.

No comments:

Post a Comment