There’s been an interesting conversation thread about the availability implications of scale up vs scale out in server virtualisation, initially at yellowbricks, then over yonder, and most recently at Scott Lowe’s blog.
What I find interesting is that the question focussed almost entirely on the impact of server failure, and with the exception of two comments (one of them mine), so far, none have them have mentioned storage, or indeed any other part of the IT infrastructure.
OK, so maybe worrying about complete data center outages is not generally part of a system engineers brief, and there is an assumption that the networking and storage layers are configured with sufficient levels of redundancy that they are thought of as being 100% reliable. This might account for the lack of concern over how many VM’s are hosted on a particular storage array or LUN, or via a set of switches. Most of the discussions around how many virtual machines should be put within a given datastore seems to focus around performance, LUN queue depths, and the efficiency distributed vs centralised lock managers.
From a personal perspective, I don’t think this level of trust is entirely misplaced, but while most virtualisation savvy engineers seem to be on top of the reliability characteristics of high-end servers, there doesn’t seem to be a corresponding level of expertise in evaluating the reliability characteristics of shared storage systems.
Working for NetApp gives me access to a body of reasearch material that most people aren’t aware of. A lot of it is publically available, but it can be a little hard to get your head around.
Part of the problem is that there have been few formal studies published analyzing the reliability of storage system components. Early work done in 1989 presented a reliability model based on formula derived, and datasheet-specified MTTF of each component, assuming component failures follow exponential distributions and that failures were independent.
Models based on these assumptions and that systems should be modeled using homogenous Poisson processes remain in common use today, however research sponsored by NetApp shows that these models may severely underestimate the annual failure rates for important subsystems such as RAID and Disk Shelves/Disk Access Enclosures and their associated interconnects.
Two NetApp sponsored studies : “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf” and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf” Contain sophisticated models supported by field data for evaluating the reliability of various storage array configurations. These reports are a little dense, so I’ll summarise some of the key findings below.
- Physical interconnects failures make up the largest part (27-68%) of storage subsystem failures, disk failures make up the second largest part (20-55%).
- Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect.
- Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.
- State of the art disk reliability models yields estimates of Dual Drive Failures that are as much as 4,000 times greater than the commonly used Mean Time to Data Loss (MTTDL) based estimates
- Latent defects are inevitable, and scrubbing latent defects is imperative to RAID N + 1 ((RAID-4, RAID-5, RAID-1, RAID-10)) reliability. As HDD capacity increases, the number of latent defects will also increase and render the MTTDL method less accurate.
- Although scrubbing is a viable method to eliminate latent defects, there is a trade-off between serving data and scrubbing. As the demand on the HDD increases, less time will be available to scrub. If scrubbing is given priority, then system response to demands for data will be reduced. A second alternative to accept latent defects and increase system reliability is to increase redundancy to N + 2, (RAID-6). Configurations that utilize RAID-6, allow RAID scrubs to be deferred to times when their performance impact will not affect production workloads.
Another interesting thing you find is that once you start using these more sophisticated reliability models is that most RAID-5 raid sets have an availability percentage of around “three nines”, RAID-10, comes in at about “four nines”, and only RAID-6 gets close the magical figure of “five nines” of availability. Dont get me wrong, the array as an entire entity may have well over “five nines”, which is important because the failure of a single array can impact tens, if not hundreds of servers, but at the individual RAID group level the availability percentages are way below that.
In the good old days where the failure of a single RAID group generally affected a single server these kinds of availability percentages were probably ok, but when a LUN/RAID group is being used for a VMware datastore ;where the failure of a single RAID group may impact tens, or possible hundreds of virtual machines; the reliability of the RAID group becomes as important as the availability of whole array used to be.
If you’re going to put all your eggs in one basket, then you better be sure that your basket is clad in titanium with pneumatic padding. This applies not just to getting the most out of your servers, but needs to go all the way through the infrastructure stack down to the LUN and RAID groups that store all that critical data.