There are few incidents that can truly be called disasters ; things like the World Trade center bombing, the Boxing Day tsunami, and the Ash Wednesday bushfires. Whether or not you think a major failure in IT infrastructure can be called a disaster just like those true tragedies, the recent and very public failure of the EMC storage infrastructure in the state of Virginia is the kind of event none of us should wish on anyone.
While we all like to see the mighty taken down a peg or two, there’s a little too much schadenfreude on this incident from the storage community for my taste. Most of us have had at least one incident in their career that we are all very glad never got the coverage this has, and my heart goes out to everyone involved… “There, but for the grace of God go I … ”
I must say though, I’m a little surprised on what appears to be a finger pointing exercise with a focus on operator error, even though it would confirm my belief that Mean Time Before Cock-up (MTBC) is a more important metric than Mean Time Before Failure (MTBF). Based on the nature of the outage and some subsequent reports, it looks like there was a failure of not just one but two memory cache boards in the DMX3. If so I’d have to agree with statements I’ve seen in the press saying the chances of that happening are incredibly unlikely or even unheard of. In a FAS array this would be equivalent to an NVRAM failure followed by a subsequent NVRAM failure during takeover, though even then at worst, the array would be back up with no loss of data consistency within a few hours. Having said that, the chances of either of these kinds of double failure events happening are almost unimaginable, but certainly, as recently shown, not impossible. How ” an operator [using] an out-of-date procedure to execute a routine service operation during a planned outage” could cause that kind of double failure is kind of beyond me, and has changed my opinion on the DMX’s supposedly rock solid architecture.
What I believe happened was not a failure of EMC engineering (which I highly respect), or even a failure of the poor tech who is followed the “outdated procedure” but rather it was a “failure of imagination”.
In this case the unimaginable happened, an critical component that “never failed”, did fail, something which provides a valuable lesson to everyone who builds, operates and funds mission critical IT infrastructure. Regardless of whether the problem was caused by faulty hardware, tired technicians, or stray meteorites, there really is no substitute for defense in depth. As a customer / integrator that means
- redundant components
- redundant copies of data on both local and remote hardware,
- well rehearsed D/R plans.
It doesnt matter what the MTBF figures say, you have to design on the assumption that something will fail and then do everything within your time, skill and budget to mitigate against that failure. If there are still exposures, then everyone who is at risk from those exposures needs to be aware of them and what risks you’re taking on their behalf. We wouldn’t expect anything less from our doctors, I don’t see why we shouldn’t hold ourselves to that same high standard.
As vendors it’s our responsibility to make the features like snapshots, mirroring, and rapid recovery affordable, and easy to use, and do everything we can to encourage our customers to implement them effectively. From my perspective NetApp does a good job of this, and that’s one of the reasons I like working there.
As more infrastructure gets moved into external clouds, I think its inevitable we’re going to hear a lot more about incidents like this as they become more public in their impact. Practices that were OK in the 1990’s no longer work in large publically hosted infrastructures when many of the old assumptions about deploying infrastructure dont hold true.
Hopefully everyone responsible for this kind of multi-tenant infrastructure is reviewing their deployment to make sure they’re not going to be next week’s front page news.
There’s been an interesting conversation thread about the availability implications of scale up vs scale out in server virtualisation, initially at yellowbricks, then over yonder, and most recently at Scott Lowe’s blog.
What I find interesting is that the question focussed almost entirely on the impact of server failure, and with the exception of two comments (one of them mine), so far, none have them have mentioned storage, or indeed any other part of the IT infrastructure.
OK, so maybe worrying about complete data center outages is not generally part of a system engineers brief, and there is an assumption that the networking and storage layers are configured with sufficient levels of redundancy that they are thought of as being 100% reliable. This might account for the lack of concern over how many VM’s are hosted on a particular storage array or LUN, or via a set of switches. Most of the discussions around how many virtual machines should be put within a given datastore seems to focus around performance, LUN queue depths, and the efficiency distributed vs centralised lock managers.
From a personal perspective, I don’t think this level of trust is entirely misplaced, but while most virtualisation savvy engineers seem to be on top of the reliability characteristics of high-end servers, there doesn’t seem to be a corresponding level of expertise in evaluating the reliability characteristics of shared storage systems.
Working for NetApp gives me access to a body of reasearch material that most people aren’t aware of. A lot of it is publically available, but it can be a little hard to get your head around.
Part of the problem is that there have been few formal studies published analyzing the reliability of storage system components. Early work done in 1989 presented a reliability model based on formula derived, and datasheet-specified MTTF of each component, assuming component failures follow exponential distributions and that failures were independent.
Models based on these assumptions and that systems should be modeled using homogenous Poisson processes remain in common use today, however research sponsored by NetApp shows that these models may severely underestimate the annual failure rates for important subsystems such as RAID and Disk Shelves/Disk Access Enclosures and their associated interconnects.
Two NetApp sponsored studies : “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf” and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf” Contain sophisticated models supported by field data for evaluating the reliability of various storage array configurations. These reports are a little dense, so I’ll summarise some of the key findings below.
- Physical interconnects failures make up the largest part (27-68%) of storage subsystem failures, disk failures make up the second largest part (20-55%).
- Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect.
- Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.
- State of the art disk reliability models yields estimates of Dual Drive Failures that are as much as 4,000 times greater than the commonly used Mean Time to Data Loss (MTTDL) based estimates
- Latent defects are inevitable, and scrubbing latent defects is imperative to RAID N + 1 ((RAID-4, RAID-5, RAID-1, RAID-10)) reliability. As HDD capacity increases, the number of latent defects will also increase and render the MTTDL method less accurate.
- Although scrubbing is a viable method to eliminate latent defects, there is a trade-off between serving data and scrubbing. As the demand on the HDD increases, less time will be available to scrub. If scrubbing is given priority, then system response to demands for data will be reduced. A second alternative to accept latent defects and increase system reliability is to increase redundancy to N + 2, (RAID-6). Configurations that utilize RAID-6, allow RAID scrubs to be deferred to times when their performance impact will not affect production workloads.
Another interesting thing you find is that once you start using these more sophisticated reliability models is that most RAID-5 raid sets have an availability percentage of around “three nines”, RAID-10, comes in at about “four nines”, and only RAID-6 gets close the magical figure of “five nines” of availability. Dont get me wrong, the array as an entire entity may have well over “five nines”, which is important because the failure of a single array can impact tens, if not hundreds of servers, but at the individual RAID group level the availability percentages are way below that.
In the good old days where the failure of a single RAID group generally affected a single server these kinds of availability percentages were probably ok, but when a LUN/RAID group is being used for a VMware datastore ;where the failure of a single RAID group may impact tens, or possible hundreds of virtual machines; the reliability of the RAID group becomes as important as the availability of whole array used to be.
If you’re going to put all your eggs in one basket, then you better be sure that your basket is clad in titanium with pneumatic padding. This applies not just to getting the most out of your servers, but needs to go all the way through the infrastructure stack down to the LUN and RAID groups that store all that critical data.