Home > Performance, Value > How does capacity utilisation affect performance ?

How does capacity utilisation affect performance ?

A couple of days ago I saw an email asking “what is the recommendation for maximum capacity utilization that will not cause performance degradation”. On the one hand this kind of question annoys me because for the most part it’s borne out of some the usual FUD which gets thrown at NetApp on a regular basis, but on the other, even though correctly engineering storage for consistent performance rarely, if ever, boils down to any single metric, understanding capacity utilisation and its impact on performance is an important aspect of storage design.

Firstly, for the record, I’d like to reiterate that the performance characteristics of every storage technology I’m aware of that is based on spinning disks decreases in proportion to the amount of capacity consumed.

With that out of the way, I have to say that as usual, the answer to the question of how does capacity utilisation affect performance is, “it depends”, but for the most part, when this question is asked, it’s usually asked about high performance write intensive applications like VDI, and some kinds of online transaction processing, and email systems.

If you’re looking at that kind of workload, then you can always check out good old TR-3647 which talks specifically about a write intensive high performance workloads where it says

The Data ONTAP data layout engine, WAFL®, optimizes writes to disk to improve system performance and disk bandwidth utilization. WAFL optimization uses a small amount of free or reserve space within the aggregate. For write-intensive, high-performance workloads we recommend leaving available approximately 10% of the usable space for this optimization process. This space not only ensures high-performance writes but also functions as a buffer against unexpected demands of free space for applications that burst writes to disk

I’ve seen other benchmarks using synthetic workloads where a knee in the performance curve begins to be seen at between 98% and 100% of the usable capacity after WAFL reserve is taken away, I’ve also seen performance issues when people completely fill all the available space and then hit it with lots of small random overwrites (especially misaligned small random overwrites). This is not unique to WAFL, which is why it’s a bad idea generally to fill up all the space in any data structure which is subjected to heavy random write workloads.

Having said that for the vast majority of workloads you’ll get more IOPS per spindle out of a netapp array at all capacity points than you will out of any similarly priced/configured box from another vendor

Leaving the FUD aside, (the complete rebuttal of which requires a fairly deep understanding of ONTAP’s performance achitecture)  when considering capacity and its effect on performance on a NetApp FAS array it’s worth keeping the following points in mind.

  1. For any given workload, and array type you’re only ever going to get a fairly limited number transactions per 15K RPM disk, usually less than 250
  2. Array performance is usually determined  by how many disks you can throw at the workload
  3. Most array vendors bring more spindles to the workload by using RAID-10 which uses twice the amount of disks for the same capacity, NetApp uses RAID-DP which does not automatically double the spindle density
  4. In most benchmarks (check out SPC-1), NetApp uses all but 10% of the available space (in line with TR-3647) which allows the user to use approximately 60% of the RAW capacity  while still achieving the same kinds of IOPS/drive that more other vendors are only able to do using 30% of the RAW capacity. i.e at the same performance per drive we offer 10% more usable capacity than the other vendors could theoretically attain using RAID-10.

The bottom line is, that even without dedupe or thin provisioning or anything else you can store twice as much information in a FAS array for the same level of performance as most competing solutions using RAID-10

While that is true, it’s worth mentioning it does have one drawback. While the IOPS/Spindle is more or less the same, the IOPS density measured in IOPS/GB on the NetApp SPC-1 results is about half that of the competing solutions, (same IOPS , 2x as much data = half the density). While that is actually harder to do because you have a lower cache:data ratio, if you have an application that requires very dense IOPS/GB (like some VDI deployments for example), then you might not be allocate all of that extra capacity to that workload.  This in my view gives you three choices.

  1. Don’t use the extra capacity, just leave it as unused freespace in the aggregate which will make it easier to optimise writes
  2. Use that extra capacity for lower tier workloads such as storing snapshots or a mirror destination, or archives etc, and set those workloads to a low priority using FlexShare
  3. Put in a FlashCache card which will double the effective number of IOPS (depending on workload of course) per spindle, which is less expensive and resource consuming than doubling the number of disks

If you dont do this, then you may run into a situation I’ve heard of  in a few cases where our storage efficiencies allowed the user to put too many hot workloads on not enough spindles, and unfortunately this is probably the basis for the  “Anecdotal Evidence”  that allows the Netapp Capacity / Performance FUD to be perpetuated. This is innacurate because it has less to do with the intricacies of ONTAP and WAFL, and far more to do with systems that were originally sized for a workload of X having a workload of 3X placed on them because there was still capacity available on Tier-1 disk capacity, long after all the performance had been squeezed out of the spindles by other workloads.

Keeping your storage users happy, means not only managing the available capacity, but also managing the available performance. More often than not, you will run out of one before you run out of the other and running an efficient IT infrastructure means balancing workloads between these two resources. Firstly this means you have to spend at least some time measuring, and monitor both the capacity and performance of your environment. Furthermore you should also set your system up to it’s easy to migrate and rebalance workloads across other resource pools, or be able to easily add performance to your existing workloads non disruptively which can be done via technologies such as Storage DRS in vSphere 5, or ONTAP’s Data motion and Virtual storage tiering features.

When it comes to measuring your environment so you can take action before the problems arise, NetApp has a number of excellent tools to monitor the performance of your storage environment. Performance Advisor gives you visualization and customised alerts and thresholds for the detailed inbuilt performance metrics available on every FAS Array, and OnCommand Insight Balance provides deeper reporting and predictive analysis of your entire virtualised infrastructure including non-NetApp hardware.

Whether you use NetApp’s tools or someone elses, the important thing is that you use them, and take a little time out of you day to find out which metrics are important and what you should do when thresholds or high watermarks are breached. If you’re not sure about this for your NetApp environment, feel free to ask me here, or better still open up a question in the Netapp communities which has a broader constituency than this blog.

While I appreciate that it’s tempting to just fall back to old practices, and overengineer Tier-1 storage so that there is little or no possibility of running out of IOPS before you run out of capacity, this is almost always incredibly wasteful and has in my experience resulted in storage utilisation rates of less than 20%, and  drives the costs/GB for “Tier-1” storage to unsustainable and uneconomically justifiable heights. The time may come when storage administrators are given the luxury of doing this again, or you may be in one of those rare industries where cost is no object, but unless you’ve got that luxury, it’s time to brush up on your monitoring and storage workload rebalancing and optimisation skills. Doing more with less is what it’s all about.

As always, comments and contrary views are welcomed.

Categories: Performance, Value
  1. bbk
    February 14, 2013 at 6:39 pm

    Hello Richard. Next phrase is not clear for me, could you please comment it?
    >knee in the performance curve begins to be seen at between 98% and 100% of the usable capacity after WAFL reserve is taken away

    Is the knee in the performance curve begins begins to be seen 98% and 100% of “Capacity-WAFL_reserve” or between 98% and 100% of “Just Capacity”?

  2. February 18, 2013 at 10:35 am

    In the benchmark I saw, this was between 98 and 100% of the usable capacity of the aggregate from a storage administrators perspective. I’m not suggesting this is a high watermark you should be aiming for, just to point out that different workloads will allow different levels of overall utilization.

    • February 18, 2013 at 5:04 pm

      Hi thanks for the reply.
      Can we put to the aggregate more then 100% data (in case we have 10% reserve for WAFL)?

      • February 23, 2013 at 9:08 pm

        The short answer is no, the wafl reserve can’t be used fro store user data, however with compression and dedup it’s quite possible to store more data than the volume size.

  3. bbk
    March 15, 2013 at 9:52 pm

    Again and again competitors (usually HP and EMC) saying to a customer that NetApp have problems with performance when it is filed up.

    Again and again we have to prove what we should not to prove.

    Which result you can recommend to parry this once and forever?

    • March 16, 2013 at 12:35 pm

      Combat fear with facts

      1. All disk based storage systems performance slow as they fill up. This is due to a number if factors but the most important one is that in most cases as you fill your array the size of your active data also increases which means fewer cache hits which means your disks have to work harder and that means on average they take longer to service each iop – ie as your array fills it gets slower. This is true for everyone.

      2. To overcome this issue each vendor has recommendations around configuring their array for high performance . When we configure for a system performance we recommend that you leave about 15% of the available space unused. This keeps the iops density at reasonable levels and gives us some space to make write optimisation easier. Other vendors use RAID10

      3. NetApp arrays outperform other arrays in terms of iops per magnetic drive in all published benchmarks, at utilisation levels impossible to attain with RAID10. By following best practice you will get better performance and more usable space out of the same disks from NetApp than from any other competitive offering.

      4. There are some benchmarks done by other vendors of NetApp gear that shows sudden performance degradation over time. EVERY one of these use configs that go against or recommendations eg lots if small aggregates and/or misaligned LUNs and also reference platforms that are very memory limited with old versions of ONTAP that lack specific technologies like FSR and WAR that continuously optimise the disk layout. There are some “do it yourself” tests that are meant to show the performance degradation the competitors talk about. These exploit the ability for ONTAP to hyper optimise a nearly empty system. personally I welcome these tests but only if they run them for long enough to see the performance stabilise, which happens at a very high performance level, and if the competitor is happy to stand up an equivalent system and let the customer run those same tests on their equipment. On a side by side test with equivalent hardware a FAS array will outperform the competition at all capacity points for almost any test, if we are 50% faster at the end of the test, do you think anyone will mind if we were 500% faster at the beginning ?

  1. November 9, 2011 at 11:02 am

Leave a Reply - Comments Manually Moderated to Avoid Spammers

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: