How does capacity utilisation affect performance ?
A couple of days ago I saw an email asking “what is the recommendation for maximum capacity utilization that will not cause performance degradation”. On the one hand this kind of question annoys me because for the most part it’s borne out of some the usual FUD which gets thrown at NetApp on a regular basis, but on the other, even though correctly engineering storage for consistent performance rarely, if ever, boils down to any single metric, understanding capacity utilisation and its impact on performance is an important aspect of storage design.
Firstly, for the record, I’d like to reiterate that the performance characteristics of every storage technology I’m aware of that is based on spinning disks decreases in proportion to the amount of capacity consumed.
With that out of the way, I have to say that as usual, the answer to the question of how does capacity utilisation affect performance is, “it depends”, but for the most part, when this question is asked, it’s usually asked about high performance write intensive applications like VDI, and some kinds of online transaction processing, and email systems.
If you’re looking at that kind of workload, then you can always check out good old TR-3647 which talks specifically about a write intensive high performance workloads where it says
The Data ONTAP data layout engine, WAFL®, optimizes writes to disk to improve system performance and disk bandwidth utilization. WAFL optimization uses a small amount of free or reserve space within the aggregate. For write-intensive, high-performance workloads we recommend leaving available approximately 10% of the usable space for this optimization process. This space not only ensures high-performance writes but also functions as a buffer against unexpected demands of free space for applications that burst writes to disk
I’ve seen other benchmarks using synthetic workloads where a knee in the performance curve begins to be seen at between 98% and 100% of the usable capacity after WAFL reserve is taken away, I’ve also seen performance issues when people completely fill all the available space and then hit it with lots of small random overwrites (especially misaligned small random overwrites). This is not unique to WAFL, which is why it’s a bad idea generally to fill up all the space in any data structure which is subjected to heavy random write workloads.
Having said that for the vast majority of workloads you’ll get more IOPS per spindle out of a netapp array at all capacity points than you will out of any similarly priced/configured box from another vendor
Leaving the FUD aside, (the complete rebuttal of which requires a fairly deep understanding of ONTAP’s performance achitecture) when considering capacity and its effect on performance on a NetApp FAS array it’s worth keeping the following points in mind.
- For any given workload, and array type you’re only ever going to get a fairly limited number transactions per 15K RPM disk, usually less than 250
- Array performance is usually determined by how many disks you can throw at the workload
- Most array vendors bring more spindles to the workload by using RAID-10 which uses twice the amount of disks for the same capacity, NetApp uses RAID-DP which does not automatically double the spindle density
- In most benchmarks (check out SPC-1), NetApp uses all but 10% of the available space (in line with TR-3647) which allows the user to use approximately 60% of the RAW capacity while still achieving the same kinds of IOPS/drive that more other vendors are only able to do using 30% of the RAW capacity. i.e at the same performance per drive we offer 10% more usable capacity than the other vendors could theoretically attain using RAID-10.
The bottom line is, that even without dedupe or thin provisioning or anything else you can store twice as much information in a FAS array for the same level of performance as most competing solutions using RAID-10
While that is true, it’s worth mentioning it does have one drawback. While the IOPS/Spindle is more or less the same, the IOPS density measured in IOPS/GB on the NetApp SPC-1 results is about half that of the competing solutions, (same IOPS , 2x as much data = half the density). While that is actually harder to do because you have a lower cache:data ratio, if you have an application that requires very dense IOPS/GB (like some VDI deployments for example), then you might not be allocate all of that extra capacity to that workload. This in my view gives you three choices.
- Don’t use the extra capacity, just leave it as unused freespace in the aggregate which will make it easier to optimise writes
- Use that extra capacity for lower tier workloads such as storing snapshots or a mirror destination, or archives etc, and set those workloads to a low priority using FlexShare
- Put in a FlashCache card which will double the effective number of IOPS (depending on workload of course) per spindle, which is less expensive and resource consuming than doubling the number of disks
If you dont do this, then you may run into a situation I’ve heard of in a few cases where our storage efficiencies allowed the user to put too many hot workloads on not enough spindles, and unfortunately this is probably the basis for the “Anecdotal Evidence” that allows the Netapp Capacity / Performance FUD to be perpetuated. This is innacurate because it has less to do with the intricacies of ONTAP and WAFL, and far more to do with systems that were originally sized for a workload of X having a workload of 3X placed on them because there was still capacity available on Tier-1 disk capacity, long after all the performance had been squeezed out of the spindles by other workloads.
Keeping your storage users happy, means not only managing the available capacity, but also managing the available performance. More often than not, you will run out of one before you run out of the other and running an efficient IT infrastructure means balancing workloads between these two resources. Firstly this means you have to spend at least some time measuring, and monitor both the capacity and performance of your environment. Furthermore you should also set your system up to it’s easy to migrate and rebalance workloads across other resource pools, or be able to easily add performance to your existing workloads non disruptively which can be done via technologies such as Storage DRS in vSphere 5, or ONTAP’s Data motion and Virtual storage tiering features.
When it comes to measuring your environment so you can take action before the problems arise, NetApp has a number of excellent tools to monitor the performance of your storage environment. Performance Advisor gives you visualization and customised alerts and thresholds for the detailed inbuilt performance metrics available on every FAS Array, and OnCommand Insight Balance provides deeper reporting and predictive analysis of your entire virtualised infrastructure including non-NetApp hardware.
Whether you use NetApp’s tools or someone elses, the important thing is that you use them, and take a little time out of you day to find out which metrics are important and what you should do when thresholds or high watermarks are breached. If you’re not sure about this for your NetApp environment, feel free to ask me here, or better still open up a question in the Netapp communities which has a broader constituency than this blog.
While I appreciate that it’s tempting to just fall back to old practices, and overengineer Tier-1 storage so that there is little or no possibility of running out of IOPS before you run out of capacity, this is almost always incredibly wasteful and has in my experience resulted in storage utilisation rates of less than 20%, and drives the costs/GB for “Tier-1” storage to unsustainable and uneconomically justifiable heights. The time may come when storage administrators are given the luxury of doing this again, or you may be in one of those rare industries where cost is no object, but unless you’ve got that luxury, it’s time to brush up on your monitoring and storage workload rebalancing and optimisation skills. Doing more with less is what it’s all about.
As always, comments and contrary views are welcomed.