Measuring Array Reliability
In a Register article http://www.theregister.co.uk/2013/02/11/storagebod_8feb13/ @storagebod asked vendors to disclose all their juicy reliabilit figures. This post is in response to that, though most of this comes from a preamble I wrote almost two years ago to an RFP response around system reliability, so it highights a number of NetApp specific technologies. It’s kind of dense, and some of the supporting information is getting a little old now, even so I still think its accurate, and helps to explain vendors are a careful about giving out single reliability metrics for disk arrays.
There have been few formal studies published analyzing the reliability of storage system components. Early work done in 1989 presented a reliability model based on formula and datasheet-specified MTTF of each component, assuming component failures follow exponential distributions and that failures are independent. Models based on these assumptions and that systems should be modeled using homogenous Poisson processes remain in common use today, however research sponsored by NetApp shows that these models may severely underestimate the annual failure rates for important subsystems such as RAID and Disk Shelves/Disk Access Enclosures and their associated interconnects.
Two NetApp sponsored studies : “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf” and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf” Contain sophisticated models supported by field data for evaluating the reliability of various storage array configurations. These findings, and their impact on our how NetApp designs its systems are summarized below.
Physical interconnects failures make up the largest part (27-68%) of storage subsystem failures, disk failures make up the second largest part (20-55%). This is addressed via redundant shelf interconnects and Dual Parity RAID techniques
- Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect. This is the underlying reason for including redundant interconnects.
- Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf. Data OnTAP’s default RAID creation policies follow this model, in addition Syncmirror provides an additional level of redundancy and protection for the most critical data.
- State of the art disk reliability models yields estimates of Dual Drive Failures that are as much as 4,000 times greater than the commonly used Mean Time to Data Loss (MTTDL) based estimates
- Latent defects are inevitable, and scrubbing latent defects is imperative to RAID N + 1 reliability. As HDD capacity increases, the number of latent defects will also increase and render the MTTDL method less accurate.
- Although scrubbing is a viable method to eliminate latent defects, there is a trade-off between serving data and scrubbing. As the demand on the HDD increases, less time will be available to scrub. If scrubbing is given priority, then system response to demands for data will be reduced. A second alternative to accept latent defects and increase system reliability is to increase redundancy to N + 2, RAID 6.
Because of the difficulty in creating a readily understood model that accurately reflects the complex interrelations of component reliability for systems with a mixture of exponential and Wiebull component failure distributions NetApp publishes independently audited reliability metrics based on a rolling 6 month audit
Run hours and downtime are collected via AutoSupport reports based on 6-month rolling time period, from customer systems with active NetApp support agreements
– Availability data is automatically reported for >15,000 FAS systems (FAS6000, FAS3000, FAS2000, FAS900 & FAS200)
System downtime is counted when caused by NetApp system:
– Hardware failures (e.g., controller, expansion cards, shelves, disks)
– Software failures
– Planned outages associated with replacing a failed component (FRU)
System downtime is not counted as a result of:
– Power and other environmental failures (e.g., excessive ambient temp)
– Operator-initiated downtime
System Availability = 1- [sum of all downtime / sum of total run time]
The graph at the top of this post shows the availability range of the all FAS models. The increasing black line at the bottom represents the introduction of a new FAS array which started out at over “five nines”, over time as a greater population of machines were deployed the average reliability increased, trending towards the “six nines” of availability achieved by our most commonly deployed array models as shown in the blue line at the top.
The other interesting thing about the way we measure downtime is that this discounts Operator-initiated downtime. Given that most hardware systems from reputable vendors are very reliable this may be the largest cause of overall system downtime. Clustered Ontap was designed to specifically eliminate or at the very least substantially mitigate the requirement for planned downtime for storage operations, leaving data center outages as the only major cause of system dowtime, and with SnapMirror we can help mitigate that one too.
As always comments and criticisms are welcome.