Sunday, May 25, 2014

Don't Press the Jolly Candy-Like Button


How can he possibly resist the maddening urge to erradicate history at the mere push of a single button? The beautiful, shiny button? The jolly, candy-like button? Will he hold out, folks? Can he hold out?

- Ren & Stimpy - "The Button"




So a customer has had a Data Unavailability Event and you have a root cause. Or better yet, you discover it internally before any customer encounters the problem. (Better still would have been to discover and fix the problem before shipping.)

This brings you to a difficult question - what do you tell your customer base about the problems you know about? Note that nothing I write in the following paragraphs should be construed to mean you should ever keep information from your customers - rather this represents some thoughts as to how you should communicate with your customers.

First and foremost the need to do this sort of communication is indicative of a failure somewhere in your release product. I've worked for many companies and have yet to find one that doesn't experience these failures. The better ones however have processes in place to learn from these mistakes and improve their quality going forward.

So back to the original premise. What do you tell your customers? And how do you communicate with them? And the answer to both of these is the infamous "it depends".

I've seen the following types of criteria used:
  • Is it avoidable? Certain types of problems are consistently triggered by a certain type of use case under certain (presumably unusual) conditions. There may be a workaround to fulfill the use case without risking the Data Unavailability condition. In any case, since this is indicative of a preventable problem this points to the type of communication which should be loud and clear. In these cases you really should consider pushing these sorts of communication aggressively. 
  • Is it an upgrade issue? I've seen some issues that happen during or after an upgrade. Sometimes this is due to a regression, sometimes due to something in the upgrade process, sometimes due to a new feature which may have an unintended effect. In these conditions you'd want to make certain that customers are aware of the potential issues prior to performing the upgrade. Including a list of these known issues in your upgrade documentation (and being able to update this documentation as new issues are discovered) allow your customers to make an intelligent decision regarding an upgrade. Some customers may be better staying at an older version until a patch with a fix is available. Others may want to take a preventative action prior to performing the upgrade.
  • Is it a likely issue? This is not to say you should not tell customers about unlikely issues. Rather, it is indicative as to whether or not you should use aggressive broadcasts to make certain they are aware of the issue. If there's no workaround for a likely issue but a patch is available you absolutely want to issue an alert to your customers.
  • Is this a severe issue? Any outage is severe. But if there is something with potential for a catastrophic outage with a long recovery time then you need to give serious conditions to broadcasting the problem instead of release noting it.
  • Is there a recovery process? After a problem is hit, if there is some process to recover there should be some documentation easily located by customers to tell them how to get out of the situation. Knowledgebase Articles are a popular means of doing this.
Even an unlikely problem which has no avoidance mechanism and will automatically recover should be documented. An up to date set of release notes is a good place to put such information (often in addition to the above possibilities). What would a customer want to know? Typical items include:
  • What happens if the problem is hit?
  • Under what conditions will the problem be hit?
  • What is the likelihood of hitting the problem? Is it deterministic?
  • What can be done to avoid the problem?
  • What needs to be done to recover from the problem?
  • What versions of software are vulnerable to the problem.
The worst thing you could find yourself doing is explaining to a customer how you knew about the potential for a problem but didn't tell them,