There but for the grace of god ..
There are few incidents that can truly be called disasters ; things like the World Trade center bombing, the Boxing Day tsunami, and the Ash Wednesday bushfires. Whether or not you think a major failure in IT infrastructure can be called a disaster just like those true tragedies, the recent and very public failure of the EMC storage infrastructure in the state of Virginia is the kind of event none of us should wish on anyone.
While we all like to see the mighty taken down a peg or two, there’s a little too much schadenfreude on this incident from the storage community for my taste. Most of us have had at least one incident in their career that we are all very glad never got the coverage this has, and my heart goes out to everyone involved… “There, but for the grace of God go I … ”
I must say though, I’m a little surprised on what appears to be a finger pointing exercise with a focus on operator error, even though it would confirm my belief that Mean Time Before Cock-up (MTBC) is a more important metric than Mean Time Before Failure (MTBF). Based on the nature of the outage and some subsequent reports, it looks like there was a failure of not just one but two memory cache boards in the DMX3. If so I’d have to agree with statements I’ve seen in the press saying the chances of that happening are incredibly unlikely or even unheard of. In a FAS array this would be equivalent to an NVRAM failure followed by a subsequent NVRAM failure during takeover, though even then at worst, the array would be back up with no loss of data consistency within a few hours. Having said that, the chances of either of these kinds of double failure events happening are almost unimaginable, but certainly, as recently shown, not impossible. How ” an operator [using] an out-of-date procedure to execute a routine service operation during a planned outage” could cause that kind of double failure is kind of beyond me, and has changed my opinion on the DMX’s supposedly rock solid architecture.
What I believe happened was not a failure of EMC engineering (which I highly respect), or even a failure of the poor tech who is followed the “outdated procedure” but rather it was a “failure of imagination”.
In this case the unimaginable happened, an critical component that “never failed”, did fail, something which provides a valuable lesson to everyone who builds, operates and funds mission critical IT infrastructure. Regardless of whether the problem was caused by faulty hardware, tired technicians, or stray meteorites, there really is no substitute for defense in depth. As a customer / integrator that means
- redundant components
- redundant copies of data on both local and remote hardware,
- well rehearsed D/R plans.
It doesnt matter what the MTBF figures say, you have to design on the assumption that something will fail and then do everything within your time, skill and budget to mitigate against that failure. If there are still exposures, then everyone who is at risk from those exposures needs to be aware of them and what risks you’re taking on their behalf. We wouldn’t expect anything less from our doctors, I don’t see why we shouldn’t hold ourselves to that same high standard.
As vendors it’s our responsibility to make the features like snapshots, mirroring, and rapid recovery affordable, and easy to use, and do everything we can to encourage our customers to implement them effectively. From my perspective NetApp does a good job of this, and that’s one of the reasons I like working there.
As more infrastructure gets moved into external clouds, I think its inevitable we’re going to hear a lot more about incidents like this as they become more public in their impact. Practices that were OK in the 1990’s no longer work in large publically hosted infrastructures when many of the old assumptions about deploying infrastructure dont hold true.
Hopefully everyone responsible for this kind of multi-tenant infrastructure is reviewing their deployment to make sure they’re not going to be next week’s front page news.