Home > Uncategorized > Measuring Array Reliability

Measuring Array Reliability


ReliabilityIn a Register article http://www.theregister.co.uk/2013/02/11/storagebod_8feb13/ @storagebod asked vendors to disclose all their juicy reliabilit figures. This post is in response to that, though most of this comes from a preamble I wrote almost two years ago to an RFP response around system reliability, so it highights a number of NetApp specific technologies. It’s kind of dense, and some of the supporting information is getting a little old now, even so I still think its accurate, and helps to explain  vendors are a careful about giving out single reliability metrics for disk arrays.

There have been few formal studies published analyzing the reliability of storage system components. Early work done in 1989 presented a reliability model based on formula and datasheet-specified MTTF of each component, assuming component failures follow exponential distributions and that failures are independent.  Models based on these assumptions and that systems should be modeled using homogenous Poisson processes remain in common use today, however research sponsored by NetApp shows that these models may severely underestimate the annual failure rates for important subsystems such as RAID and Disk Shelves/Disk Access Enclosures and their associated interconnects.

Two NetApp sponsored studies :  “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf”  and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht  in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf” Contain sophisticated models supported by field data for evaluating the reliability of various storage array configurations. These findings, and their impact on our how NetApp designs its systems are summarized below.

Physical interconnects failures make up the largest part (27-68%) of storage subsystem failures, disk failures make up the second largest part (20-55%). This is addressed via redundant shelf interconnects and Dual Parity RAID techniques

  • Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect. This is the underlying reason for including redundant interconnects.
  • Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf. Data OnTAP’s default RAID creation policies follow this model, in addition Syncmirror provides an additional level of redundancy and protection for the most critical data.
  • State of the art disk reliability models yields estimates of Dual Drive Failures that are as much as 4,000 times greater than the commonly used Mean Time to Data Loss (MTTDL) based estimates
  • Latent defects are inevitable, and scrubbing latent defects is imperative to RAID N + 1 reliability. As HDD capacity increases, the number of latent defects will also increase and render the MTTDL method less accurate.
  • Although scrubbing is a viable method to eliminate latent defects, there is a trade-off between serving data and scrubbing. As the demand on the HDD increases, less time will be available to scrub. If scrubbing is given priority, then system response to demands for data will be reduced. A second alternative to accept latent defects and increase system reliability is to increase redundancy to N + 2, RAID 6.

Because of the difficulty in creating a readily understood model that accurately reflects the complex interrelations of component reliability for systems with a mixture of exponential and Wiebull component failure distributions  NetApp publishes independently audited reliability metrics based on a rolling 6 month audit

Run hours and downtime are collected via AutoSupport reports based on 6-month rolling time period, from customer systems with active NetApp support agreements

–       Availability data is automatically reported for >15,000 FAS systems (FAS6000, FAS3000, FAS2000, FAS900 & FAS200)

System downtime is counted when caused by NetApp system:

–       Hardware failures (e.g., controller, expansion cards, shelves, disks)

–       Software failures

–       Planned outages associated with replacing a failed component (FRU)

System downtime is not counted as a result of:

–       Power and other environmental failures (e.g., excessive ambient temp)

–       Operator-initiated downtime

System Availability  = 1- [sum of all downtime / sum of total run time]

The graph at the top of this post shows the availability range of the all FAS models. The increasing black line at the bottom represents the introduction of a new FAS array which started out at over “five nines”, over time as a greater population of machines were deployed the average reliability increased, trending towards the “six nines” of availability achieved by our most commonly deployed array models as shown in the blue line at the top.

The other interesting thing about the way we measure downtime is that this discounts Operator-initiated downtime. Given that most hardware systems from reputable vendors are very reliable this may be the largest cause of overall system downtime. Clustered Ontap was designed to specifically eliminate or at the very least substantially mitigate the requirement for planned downtime for storage operations, leaving data center outages as the only major cause of system dowtime, and with SnapMirror we can help mitigate that one too.

As always comments and criticisms are welcome.

Regards

John

 

 

 

 

Categories: Uncategorized
  1. Nikolay
    March 26, 2013 at 10:20 pm

    Hi John,
    Very interesting and informative post (as usual) but after reading this rather deep NetApp doc “Dont blame disks for every storage subsystems failure” I’ve got one question. Two major reasons in classification of storage subsystem failures are Disk subsystem failure and physical interconnect failures. The fact that these two types of failure are directly compared in the document seems quite unfair to me, because disk subsystem (especially disk FC or SAS connectors) is relatively isolated from potential external destructive human or environmental actions while interconnection components such as cables and connectors between disk shelves and SAN switches are physically external with respect to other storage system hardware. So I’m trying to say that the reason for interconnect storage subsystem failures might not be related to factory defects of storage system components or bad driver software (both of which are customer independent reasons) but might be consequences of some customer actions such as server room maintenance or accidental cable damage by administrator or something like that.
    For me it would be much more interesting to see some stats for storage subsystem failures caused by controller or disk shelve internal components failures except disk subsystem components. I understand that such kind of failures are extremely rare but obviously they will eventually happen. I heard several customers ask: “OK, we have dual-controller storage system and dual parity ultra reliable RAID-DP, but are there some other things in our storage system that are not doubled and may lead to complete SAN failure (in other words which turn SAN into notorious “single point of failure”)? And unfortunately as far as I understand after reading mentioned NetApp document the answer for this questions is ‘yes’. Please correct me if I’m wrong :)

    • April 3, 2013 at 10:57 am

      I’ve been wondering about the interconnect issue since I read this report a few years ago, since then I’ve been aware of a number of issues caused directly or indirectly by the cables. What isn’t clear is whether these were due to to a manufacturing fault, poor installation, subsequent damage caused by people fidding about in the back of the racks, or environmental stress caused by temperature fluctuations. What does seem clear is that external cables are probably the most fragile physical aspect of the system. Even outside of the physical layer problems the next two networking layers also tend to be a little fragile. Keep in mind that when the report was written the vast majority of interconnects between disk controllers and shelves (NetApp and other vendors), was based on FibreChannel Abritrated Loop (FCAL), which is a “Built to a price” version of the reliable Fibre Channel infrastructure we all know and sometimes love. The only exception to this that I’m aware of was NetApp’s metrocluster which used a fully switched Fibre Channel infrastructure to attach the disks to the controllers.

      As I said, FCAL was built to a price and had some serious architectural issues including a nasty thing called a LIP storm. NetApp and others spent a lot of time and money on things like electronically switched hubs (e.g. NetApp ESH modules, and others), but only so much could be done, so a badly behaved disk or even a marginally faulty cable with intermittent problems caused by environmental stress could be the root cause of much nastier problems on the layer-2 of the interconnect. This isn’t a cable issue so much as it is a cabling issue. NetApp found that having a redundant loop allowed us to identify and more importantly non-disruptively fix the problem. This is one of the main reason NetApp insists or at least strongly reccomended multi-path HA (MPHA) cabling for all of it’s systems between the controllers and the shelves. Even when we moved to the much more reliable SAS switched infrastructure for our shelf interconnects, over a large population (100’s of thousands) of shelves, we know that there is a statistically significant improvement in the reliability of an MPHA configuration, especially when coupled with an additional out-of-band management interface (ACP).

      Other than the cable and layer-2 challenges I mentioned above, there are also HBA cards, SPF’s, firmware, drivers and a whole host of other things that can cause problems with the connection between the controller and the disks. If you lose a access to a whole shelf of disks, then even dual parity protection isnt going to help you. If this is something you are really worried about, then you might also want to consider using syncmirror which additionally mirrors the shelves, however outside of metrocluster configurations, I’ve never seen this implemented. I imagine if you’re sufficiently worried about shelf failure, you’d be more rationally worried about the statistically larger issue of data-center failure.

      So to answer your question ..

      we have dual-controller storage system and dual parity ultra reliable RAID-DP, but are there some other things in our storage system that are not doubled and may lead to complete SAN failure (in other words which turn SAN into notorious “single point of failure”)? And unfortunately as far as I understand after reading mentioned NetApp document the answer for this questions is ‘yes

      You are correct, there are things that can affect the overall reliability of the controller pair, though the biggest single factor is administrator error, and the second biggest is the datacenter the array is sitting in. The third used to be cabling, but only on controllers without MPHA.

      The good news is, that back in 2010 NetApp mandated the use of MPHA cabling in all NetApp HA systems, and even before that, it was the default configuration (you had to deselect it, and nobody I dealt with ever did that for an HA system) so its extremely likely that any production implementation you are working with is already protected. I’d also recommend taking a look through TR-3838 Storage Subsystem Configuration Guide and make sure you are working within best practice. You may also want to go to the additional step of implementing the ACP (Alternate Control Path) functionality built into all of our SAS shelves as an additional level of protection

      As far as administrator error goes, nothing replaces good operational procedures which should include hooking up your system to auto-support, and checking the risks and warnings section on My Autosupport on a regular basis. You should also read and take action on any product bulletins which Netapp may send you. Finally Datacenter level failures should be addressed with snapmirror to a separate array and a good D/R plan.

      Regards
      John Martin

  1. No trackbacks yet.

Leave a Reply - Comments Manually Moderated to Avoid Spammers

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 530 other followers

%d bloggers like this: