Archive

Archive for the ‘Performance’ Category

Unfinished Business

February 15, 2013 5 comments

-Update  11/4/13 – Darius approved the comment in question and said sorry, so I’m all happy now. Given how long and verbose my comments are, I suspect he simply didn’t have time to read through it and figure out if it was safe to publish on an Oracle hosted website. While I disagree with Darius on a lot of points, I also like a lot of what he writes, and I suspect he’d be an interesting guy to have a beer with 🙂

Forgive me socialmeda for I have sinned, It’s been four months since my last blogpost. There are a bunch of reasons for this, mostly that I’ve had some real-world stuff that’s been way more important than blogging, and I’ve limited my technical writing to posting comments on other blogs or answering questions on Linked-in. Before I start writing again, there’s something I’d like to get off my chest. It _really_ bugs me when people edit out relevant comments. As a case in point I was having what I believed to be a reasonably constructive conversation with Darius Zanganeh of Oracle on his blog , but for some reason he never approved my final comment which I submitted to his blog on December 7th 2012, the text of which follows. If you’re interested, head over to his blog and read the entire post, I think it’s pretty good, it showcases some impressive benchmark numbers Oracle has been able to achieve with scale-up commodity hardware. From my perspective this is a great example of  that a deeper analysis of good benchmarks demonstrate far more than  top line numbers and $/IOPS …and if you know me, then you know I just LOVE a good debate over benchmark results, so I couldnt resist commenting even though I really had better/more important things to do at the time.

Thanks Darius, its nice to know exactly what we’re comparing this to. I didn’t read the press releases, nor was I replying to that release, I was replying to your post which was primarily a comparison to the FAS6240.

 If you do want to compare the 7420 to the 3270, then I’ll amend the figures once again .. to get a 240% better result you used a box with

  1.  More than eleven times as many CPU cores
  2. More than one hundred and sixty times as much memory

 I really wish you’d also published Power Consumption figures too 🙂

 Regarding disk efficiency, we’ve already demonstrated our cache effectiveness. On a previous 3160 benchmark, we used a modest amount of extended cache and reduced the number of drives required by 75%. By way of comparison to give about 1080 IOPS/15K Spindle we implemented a cache that was 7.6% of the capacity of the fileset. The Oracle benchmark got about 956 IOPS/drive with a cache size about 22% of the fileset size.

 The 3250 benchmark on the other hand, wasn’t done to demonstrate cache efficiency, it was done to allow a comparison to the old 3270. It’s also worth noting that the 3250 is not a replacement for the 3270, it’s a replacement for the 3240 with around 70% more performance. Every benchmark we do is generally done to create a fairly specific proof point, in the case of the 3250 benchmark it shows that it has almost identical performance as the 3270 for a controller that sells at a much lower price point.

 We might pick one of our controllers and do a “here’s a set config and here’s the performance across every known benchmark” the way Oracle seems to have done with the 7420. It might be kind of interesting, but I’m not sure what it would prove. Personally I’d like to see all the vendors including NetApp do way more benchmarking of all their models, but it’s a time-consuming and expensive thing to do, and as you’ve already demonstrated, its easy to draw some pretty odd conclusions from them. We’ll do more benchmarking in the future, you’ll just have to wait to see the results 🙂

 Going forward, I think non-scale out benchmark configs will still be valid to demonstrate stuff like model replacement equivalency, and cache efficiency, but I’ll say it again, if you’re after “my number is the biggest” hero number bragging rights, scale out is the only game in town.  But scale-out isn’t just about hero-numbers, for customers to rapidly scale without disruption as needs change, scale-out is an elegant and efficient solution and they need to know they can do that predicably and reliably. That’s why you see the benchmark series like the ones done by NetApp and Isilon. Even though scale-out NFS is a relatively small market, and Clustered-ONTAP has a good presence in that market, scale-out Unified storage has much broader appeal and is doing really well for us.  I cant disclose numbers, but based on the information I have, I wouldn’t be surprised if the number of new clusters sold since March exceeds the number of Oracle 7420s sold in the same period, either way I’m happy with the sales of Clustered-ONTAP.

 As technology blogger, its probably worth pointing out that stock charts are a REALLY poor proxy for technology comparisons, but if you want to go there, you should also look at stuff like P/E multiples (an indication of how fast the market expects you to grow), and market share numbers. If you’ve got Oracle’s storage revenue and profitability figures hand for use to do a side by side comparison to the NetApp published financial reports, post them on up, personally I would LOVE to see a comparison. Then again, maybe your readers would prefer us to stick to talking about the merits of our technology and how that can help them solve the problems they’ve got.

 In closing, while this has been fun, I don’t have a lot more time to spend on this. I have expressed my concerns at the amount of hardware you had to throw at the solution to achieve your benchmark results, and the leeway that gives you to be  competitive with street pricing, but as I said initially your benchmark shows you can get a great scale-up number, and you’re willing to do that at a keen list price, nobody can take that away from you, kudos to you and your team.

Other than having an opportunity to have my final say, my comments also underlines some major shifts in the industry that I’ll be blogging about over the next few months.

1. If you’re after “my number is the biggest” hero number bragging rights, scale out is the only game in town

2. Scale out Unified and Clustered ONTAP is going really well, I cant publish numbers, but the uptake has surprised me, the level of interest I’ve seen from the breifings I’ve been doing has been really good.

3. Efficiency matters, getting good results by throwing boatloads of commodity hardware at a problem is one way of solving a problem, but it usually causes problems and shifts costs elsewhere in the infrastructure (power, cooling, rackspace, labour, software, compliance etc etc)

I’ll also be writing a fair amount about Flash and Storage Class Memory, and why some of the Flash Benchmarks/Claimed performance is silly enough in my opinion to to be bordering on deceptive … until then, be prepared to dig deeper when people start to claim IOPS measured in the millions, until then, have fun 🙂

John Martin (life_no_borders)

Categories: Hyperbole, Performance

More Records ??

November 14, 2011 8 comments

–This has been revised based on some comments I’ve received since the original posting, check the comment thread if you’re interested what/why–

I came in this morning with an unusually clear diary, and took the liberty to check the newsfeeds for NetApp and EMC, this is when I came across an EMC press release entitled  “EMC VNX SETS PERFORMANCE DENSITY RECORD WITH LUSTRE —SHOWCASES “NO COMPROMISE” HPC STORAGE“.

I’ve been doing some research on Lustre and HPC recently, and that claim surprised me more than a little, so I checked it out, maybe there’s a VNX sweetspot for HPC that I wasnt expecting. The one thing that stood out straight away was . “EMC® is announcing that the EMC® VNX7500 has set a performance density record with Lustre—delivering read performance density of 2GB/sec per rack” (highlight mine)

In the first revision of this I had some fun pointing out the lameness of that particular figure, (e.g. “From my perspective, measured on a GB/sec per rack, 2GB/sec/rack is pretty lackluster”) , but EMC aren’t stupid (or at least their engineers aren’t, though I’m not so sure about their PR agency at this point), so it turns out that this was one of those things where it seems that EMC’s PR people didn’t actually listen to what the engineers were saying, and it looks like they’ve missed out a small but important word, and that word is “unit”.  This becomes apparent if you take a look at the other stuff in that press release “8 GB/s read and 5.3 GB/s write sustained performance, as measured by XDD benchmark performed on a 4U dual storage processor”. This gives us 2GB/sec/rack unit which actually sounds kind of impressive. So let’s dig a little deeper, what we’ve got is a 4U dual storage processor that gets some very good raw throughput numbers, about 1.5x, or 150% faster in fact  on a “per controller” basis than the figures used on the E5400 press release I referenced earlier, so on that basis I think EMC has done a good job. But this is where the PR department starts stretching the truth again by leaving out some fairly crucial pieces of information. Notably that crucial information that the 2GB/sec/rack unit is for 4U controller is a 2U VNX7500SPE with 2U standby power supply which is required when the 60 drive dense shelves are used exclusively (which is the case for the VNX Lustre Proof of Concept information shown in their brochures), and this configuration doesn’t include any of the rack units required for the actual storage. Either that, or its a 2U VNX7500SPE with a 2U shelf , and no standby power supply that seems to be mandatory component of a VNX solution, and I cant quite believe that EMC would do that.

If we compare the VNX to the E5400, you’ll notice that controllers and standby power supplies  alone consume 4U of rack space without adding any capacity, whereas the E5400 controllers are much much smaller, and they fit directly into a 2U or 4U disk shelf (or DAE’s in EMC terminology) which means a 4U E4500 based solution is something you can actually use, as the required disk capacity is already there in the 4U enclosure.

Lets go through some worked calculations, to show how this works. In order to add capacity in the densest possible EMC configuration, you’d need to add an additional 4RU shelf with 60 drives in it. Net result 8RU, 60 drives, and up to 8 GB/s read and 5.3 GB/s write (the press release doesn’t make it clear whether a VNX7500 can actually drive that much performance from only 60 drives, my suspicion is that it cannot, otherwise we would have seen something like that in the benchmark). Meausred on a GB/s per RU basis this ends up as only 1 GB/sec per Rack Unit, not the 2 GB/sec per Rack Unit which I believe was the point of the “record setting” configuration. And just for kicks as you add more storage to the solution that number goes down as shown for the “dual VNX7500/single rack solution that can deliver up to 16GB/s sustained read performance” to about 0.4 GB/sec per Rack Unit. Using the configurations mentioned in EMC’s proof of concept configuration  you end up with around 0.666 GB/sec per Rack Unit, all of which are a lot less than the 2 GB/sec/RU claimed in the press release

If you wanted to have the highest performing configurations using a “DenseStak” solution within those 8RU with an E5400 based Lustre solution, you’d put in another e5400 unit with an additional 60 drives Net result 8RU, 120 drives, and 10 GB read and 7 GB/sec write (and yes we can prove that we can get this kind of performance from 120 drives). Meausured on a GB/s per RU basis this ends up as 1.25 GB/sec per Rack Unit. That’s good, but its still not the magic number mentioned in the EMC press release, however if you were to use a “FastStak” solution, those numbers would pretty much double (thanks to using 2RU disk shelves instead of 4RU disk shelves) which would give you controller performance density of around 2.5 GB/sec per Rack Unit.

Bottom line, for actual usable configurations a NetApp solution has much better performance density using the same measurements EMC used for their so called “Record Setting” benchmark result.

In case you think I’m making these numbers up, they are confirmed in the NetApp whitepaper wp-7142 which says

The FastStak reference configuration uses the NetApp E5400 scalable
storage system as a building block. The NetApp E5400 system is designed
to support up to 24 2.5-inch SAS drives, in a 2U form factor.
Up to 20 of these building blocks can be contained in an industry-standard
40U rack. A fully loaded rack delivers performance of up to 100GB/sec
sustained disk read throughput, 70GB/sec sustained disk write throughput,
and 1,500,000 sustained IOPS.
According to IDC, the average supercomputer produces 44GB/sec,
so a single FastStak rack is more than fast enough to meet the I/O
throughput needs of many installations.

While I’ll grant that this result is achieved with more hardware, it should be remembered that the key to good HPC performance is in part about the ability to efficiently throw hardware at a problem. From a storage point of view this means having the ability to scale performance with capacity. In this area the DenseStak and FastStak solutions are brilliantly matched to the requrements of, and the prevailing technology used, in High Performance Computing. Rather than measuring on a GB/sec/rack unit I think a better measure would be “additional sequential performance per additional gigabyte”. Measured on a full rack basis, the NetApp E5400 based solution ends up at around 27MB/sec/GB for the DenseStak, or 54MB/sec/GB for the FastStak. In comparison, the fastest EMC solution as referenced in the “record setting” press release comes in at about 10MB/sec of performance for every GB of provisioned capacity or about 22MB/sec/GB for the configuration in the proof of concept brochure . Any way you slice this, the VNX just doesn’t end up looking like a particularly viable or competetive option.

The key here is that Lustre is designed as a scale out architecture. The E5400 solution is built as a scale out solution by using Lustre to aggregate the performance of the multiple carfully matched E5400 controllers, whereas the VNX7500 used in the press release is relatively poorly matched scale-up configuration which is being shoe-horned into use case it wasn’t designed for.

In terms of performance per rack unit, or performance per GB there simply isn’t much out there that comes close to a E5400 based Lustre solution, certainly not from EMC, as even Isilon, their best Big Data offering, falls way behind. The only other significant questions that remain are how much do they cost to buy, and how much power do they consume ?

I’ve seen the pricing for EMC’s top of the range VNX 7500, and its not cheap, its not even a little bit cheap, and the ultra-dense stuff shown in the proof of concept documents is even less not cheap than their normal stuff. Now I’m not at liberty to discuss our pricing strategy in any detail on this blog, but I can say that in terms of “bang per buck”, the E5400 solutions are very very competetive, and the power impact of the E5400 controller inside of 60 drive dense shelf is pretty much negligible. I don’t have the specs for the power draw on a VNX7500 and its associated external power units , but I’m guessing it adds around as much as a shelf of disks, the power costs of which add up over the three year lifecycle typically seen in these kinds of environments.

From my perspective the VNX7500 is a good general purpose box, and EMC’s engineers have every right to be proud of the work they’ve done on it, but positioning this as a “record setting” controller for performance dense HPC workloads on Lustre, is stretching the truth just a little too far for my liking.  While the 10GB/sec/rack mentioned in the press release might sound like a lot for those of us who’ve spent their lives around transaction processing systems, for HPC, 10GB/sec/rack simply doesnt cut it. I know this, the HPC community knows this, and I suspect most of the reputable HPC focussed engineers in EMC also know this.

It’s a pity though that EMC’s PR department is spinning this for all they’re worth ; I struggle with how they can possibly assert that they’ve set any kind of performance density record for any kind of realistic Lustre implementation, when the truth is that they are so very very far behind. Maybe their PR dept has been reading 1984, because claiming record setting performance in this context requires some of the most bizarre Orwellian doublespeak I’ve seen in some time.

Breaking Records … Revisited

November 3, 2011 10 comments

So today I found out that we’d broken a few records of our own few days ago, which was, at least from my perspective associated with surprisingly little fanfare with the associated press release coming out late last night. I’d like to say that the results speak for themselves, and to an extent they do. NetApp now holds the top two spots, and four out of the top five results on the ranking ladder. If this were the olympics most people would agree that this represents a position of PURE DOMINATION. High fives all round, and much chest beating and downing of well deserved delicious amber beverages.

So, apart from having the biggest number (which is nice), what did we prove ?

Benchmarks are interesting to me because they are the almost perfect intersection of my interests in both technical storage performance  and marketing and messaging. From a technical viewpoint, a benchmark can be really useful, but it only provides a relatively small number of proof points, and extrapolating beyond those, or making generalised conclusions is rarely a good idea.

For example, when NetApp released their SPC-1 benchmarks a few years ago, it proved a number of things

1. That under heavy load which involved a large number of random writes, a NetApp arrays performance remained steady over time

2. That this could be done while taking multiple snapshots, and more importantly while deleting and retiring them while under heavy load

3. That this could be done with RAID-6 and with a greater capacity efficiency as measured by  RAW vs USED than any other submission

4. That this could be done at better levels of performance than an equivalently configured  commonly used “traditional array” as exemplified by EMCs CX3-40

5. That the copy on write performance of the snapshots on an EMC array sucked under heavy load (and by implication similar copy on write snapshot implementations on other vendors arrays)

That’s a pretty good list of things to prove, especially in the face of considerable unfounded misinformation being put out at the time, and which, surprisingly is still bandied about despite the independently audited proof to the contrary. Having said that, this was not a “my number is the biggest”, exercise which generally proves nothing more than how much hardware you had available in your testing lab at the time.

A few months later we published another SPC-1 result which showed that we could pretty much doubl the numbers we’d achieved in the previous generation at a lower price per IOP with what was at the time a very competetive submission.

About two years after that we published yet another SPC-1 result with the direct replacement for the controller used in the previous test (3270 vs 3170). What this test didnt do was to show how much more load could be placed on the system, what it did do was to show that we could give our customers more IOPS at a lower latency with half the number of spindles .   This was the first time we’d submitted an SPC-1e result which foucussed on energy efficiency. It showed, quite dramatically how effective our FlashCache technology was under a heavy random write workload. Its interesting to compare that submission with the previous one for a number of reasons, but for the most part, this benchmark was about Flashcache effectiveness.

We did a number of other benchmarks including Spec-SFS benchmarks that also proved the remarkable effectiveness of the Flashcache technology, showing how it could make SATA drives perform as better than Fibre channel drives, or dramatically reduce the number of fibre channel drives required to service a given workload. There were a couple of other benchmarks done which I’ll grant were “hey look at how fast our shiny new boxes can run”, but for the most part, these were all done with configurations we’d reasonably expect a decent number our customers to actually buy (no all SSD configurations).

In the mean time EMC released some “Lab Queen” benchmarks, at first I thought that EMC were trying to prove just how fast their new X-blades were for processing CIFS and NFS traffic. They did this by configuring the back end storage system in such a rediculously overengineered way as to remove any possibility that they could cause a bottleneck in any way, either that or EMC’s block storage devices are way slower than most people would assume. From an engineering perspective I think they guys in Hopkington who created those X-blades did a truly excellent job, almost 125,000 IOPS per X-Blade using 6 CPU cores is genuinely impressive to me, even if all they were doing was  processing NFS/CIFS calls. You see, unlike the storage processors in a FAS or Isilon array, the X-Blade, much like the Network Processor in a SONAS system, or an Oceanspace N8500 relies on a back end block processing device to handle RAID , block checksums, write cache coherency and physical data movement to and from the disks, all of which is non-trivial work. What I find particularly interesting is that in all the benchmarks I looked at for these kinds of systems, the number of back end block storage systems was usually double that of the front end, which infers to me either that the load placed on back end systems by these benchmarks is higher than the load on the front end, or  more likely that the front end / back end architecture is very sensitive to any latency on the back end systems which means the back end systems get overengineered for benchmarks. My guess is after seeing the “All Flash DMX” configuration is that Celerra’s performance is very adversly affected by even slight increases in latency in the back end and that we start seeing some nasty manifestations of little law in these architectures under heavy load.

A little while later after being present at a couple of EMC presentations (one at Cisco Live, the other at a SNIA event, where EMC staff were fully aware of my presence), it became clear to me exactly why EMC did these “my number is bigger than yours” benchmarks. Ther marketing staff at corporate created a slide that compared all of the current SPC benchmarks in a way that was accurate, compelling and completely misleading all at the same time, at least as far as the VNX portion goes. Part of this goes back to the way that vendors, including I might say Netapp, use an availability group as a point of aggregation when reporting peformance numbers, this is reasonably fair as adding Active/Active or Active/Passive availability generally slows things down due to the two phase commit nature of write caching in modular storage environments. However, the configuration of the EMC VNX VG8 Gateway/EMC VNX5700 actually involves 5 separate availability groups (1xVG8 Gateway system with 4+1 redundancy, and and 4x VNX5700 with 1+1 redundancy). Presenting this as one aggregated peformance number without any valid point of aggregation smacks of downright dishonesty to me. If NetApp had done the same thing, then, using only 4 availabilty groups, we could have claimed over 760,000 IOPS by combining 4 of our existing 6240 configurations, but we didnt, because frankly doing that is in my opinion on the other side of the fine line where marketing finesse falls off the precipice into the shadowy realm of deceptive practice.

Which brings me back to my original question, what did we prove with our most recent submissions, well three things come to mind

1. That Netapp’s Ontap 8.1 Cluster mode solution is real, and it performs briliiantly

2. It scales linearly as you add nodes (more so than the leading competitors)

3. That scaling with 24 big nodes gives you better performance and better efficiency than scaling with hundreds of smaller nodes (at least for the SPEC benchmark)

This is a valid configuration using a single vserver as a point of aggregation across the cluster, and trust me, this is only the beginning.

As always, comments and criticism is welcome.

Regards

John

How does capacity utilisation affect performance ?

September 10, 2011 7 comments

A couple of days ago I saw an email asking “what is the recommendation for maximum capacity utilization that will not cause performance degradation”. On the one hand this kind of question annoys me because for the most part it’s borne out of some the usual FUD which gets thrown at NetApp on a regular basis, but on the other, even though correctly engineering storage for consistent performance rarely, if ever, boils down to any single metric, understanding capacity utilisation and its impact on performance is an important aspect of storage design.

Firstly, for the record, I’d like to reiterate that the performance characteristics of every storage technology I’m aware of that is based on spinning disks decreases in proportion to the amount of capacity consumed.

With that out of the way, I have to say that as usual, the answer to the question of how does capacity utilisation affect performance is, “it depends”, but for the most part, when this question is asked, it’s usually asked about high performance write intensive applications like VDI, and some kinds of online transaction processing, and email systems.

If you’re looking at that kind of workload, then you can always check out good old TR-3647 which talks specifically about a write intensive high performance workloads where it says

The Data ONTAP data layout engine, WAFL®, optimizes writes to disk to improve system performance and disk bandwidth utilization. WAFL optimization uses a small amount of free or reserve space within the aggregate. For write-intensive, high-performance workloads we recommend leaving available approximately 10% of the usable space for this optimization process. This space not only ensures high-performance writes but also functions as a buffer against unexpected demands of free space for applications that burst writes to disk

I’ve seen other benchmarks using synthetic workloads where a knee in the performance curve begins to be seen at between 98% and 100% of the usable capacity after WAFL reserve is taken away, I’ve also seen performance issues when people completely fill all the available space and then hit it with lots of small random overwrites (especially misaligned small random overwrites). This is not unique to WAFL, which is why it’s a bad idea generally to fill up all the space in any data structure which is subjected to heavy random write workloads.

Having said that for the vast majority of workloads you’ll get more IOPS per spindle out of a netapp array at all capacity points than you will out of any similarly priced/configured box from another vendor

Leaving the FUD aside, (the complete rebuttal of which requires a fairly deep understanding of ONTAP’s performance achitecture)  when considering capacity and its effect on performance on a NetApp FAS array it’s worth keeping the following points in mind.

  1. For any given workload, and array type you’re only ever going to get a fairly limited number transactions per 15K RPM disk, usually less than 250
  2. Array performance is usually determined  by how many disks you can throw at the workload
  3. Most array vendors bring more spindles to the workload by using RAID-10 which uses twice the amount of disks for the same capacity, NetApp uses RAID-DP which does not automatically double the spindle density
  4. In most benchmarks (check out SPC-1), NetApp uses all but 10% of the available space (in line with TR-3647) which allows the user to use approximately 60% of the RAW capacity  while still achieving the same kinds of IOPS/drive that more other vendors are only able to do using 30% of the RAW capacity. i.e at the same performance per drive we offer 10% more usable capacity than the other vendors could theoretically attain using RAID-10.

The bottom line is, that even without dedupe or thin provisioning or anything else you can store twice as much information in a FAS array for the same level of performance as most competing solutions using RAID-10

While that is true, it’s worth mentioning it does have one drawback. While the IOPS/Spindle is more or less the same, the IOPS density measured in IOPS/GB on the NetApp SPC-1 results is about half that of the competing solutions, (same IOPS , 2x as much data = half the density). While that is actually harder to do because you have a lower cache:data ratio, if you have an application that requires very dense IOPS/GB (like some VDI deployments for example), then you might not be allocate all of that extra capacity to that workload.  This in my view gives you three choices.

  1. Don’t use the extra capacity, just leave it as unused freespace in the aggregate which will make it easier to optimise writes
  2. Use that extra capacity for lower tier workloads such as storing snapshots or a mirror destination, or archives etc, and set those workloads to a low priority using FlexShare
  3. Put in a FlashCache card which will double the effective number of IOPS (depending on workload of course) per spindle, which is less expensive and resource consuming than doubling the number of disks

If you dont do this, then you may run into a situation I’ve heard of  in a few cases where our storage efficiencies allowed the user to put too many hot workloads on not enough spindles, and unfortunately this is probably the basis for the  “Anecdotal Evidence”  that allows the Netapp Capacity / Performance FUD to be perpetuated. This is innacurate because it has less to do with the intricacies of ONTAP and WAFL, and far more to do with systems that were originally sized for a workload of X having a workload of 3X placed on them because there was still capacity available on Tier-1 disk capacity, long after all the performance had been squeezed out of the spindles by other workloads.

Keeping your storage users happy, means not only managing the available capacity, but also managing the available performance. More often than not, you will run out of one before you run out of the other and running an efficient IT infrastructure means balancing workloads between these two resources. Firstly this means you have to spend at least some time measuring, and monitor both the capacity and performance of your environment. Furthermore you should also set your system up to it’s easy to migrate and rebalance workloads across other resource pools, or be able to easily add performance to your existing workloads non disruptively which can be done via technologies such as Storage DRS in vSphere 5, or ONTAP’s Data motion and Virtual storage tiering features.

When it comes to measuring your environment so you can take action before the problems arise, NetApp has a number of excellent tools to monitor the performance of your storage environment. Performance Advisor gives you visualization and customised alerts and thresholds for the detailed inbuilt performance metrics available on every FAS Array, and OnCommand Insight Balance provides deeper reporting and predictive analysis of your entire virtualised infrastructure including non-NetApp hardware.

Whether you use NetApp’s tools or someone elses, the important thing is that you use them, and take a little time out of you day to find out which metrics are important and what you should do when thresholds or high watermarks are breached. If you’re not sure about this for your NetApp environment, feel free to ask me here, or better still open up a question in the Netapp communities which has a broader constituency than this blog.

While I appreciate that it’s tempting to just fall back to old practices, and overengineer Tier-1 storage so that there is little or no possibility of running out of IOPS before you run out of capacity, this is almost always incredibly wasteful and has in my experience resulted in storage utilisation rates of less than 20%, and  drives the costs/GB for “Tier-1” storage to unsustainable and uneconomically justifiable heights. The time may come when storage administrators are given the luxury of doing this again, or you may be in one of those rare industries where cost is no object, but unless you’ve got that luxury, it’s time to brush up on your monitoring and storage workload rebalancing and optimisation skills. Doing more with less is what it’s all about.

As always, comments and contrary views are welcomed.

Categories: Performance, Value

Data Storage for VDI – Part 9 – Capex and SAN vs DAS

July 22, 2010 4 comments

I’d intended writing about Megacaches in both the previous post, as well as this one, but interesting things keep popping up that need to be dealt with first. This time it’s an article at information week With VDI, Local Disk Is A Thing Of The Past. In it Elias Khnaser outlines the same argument that I was going to make after I’d dealt with the technical details of how NetApp optimises VDI deployments.

I still plan to expand on this with posts on Megacaches, single instancing technologies, and broker integration, but Elias’ post was so well done that I thought it deserved some immediate attention.

If you havent done so already, check out the article before you read more here, because the only point I want to make in this uncharacteristically small post is the following

The capital expenditure for storage in a VDI deployment based on NetApp is lower than one based on direct attached storage.

This is based on the following

Solution 1: VDI Using Local Storage – Cost

$614,400

Solution 2 : VDI Using HDS Midrange SAN – Cost

$860,800, with array costs of approx $400,000

Solution 3 : VDI Using FAS 2040 – Cost

860,000 – 400,000 + (2000 * $50) = $560,000

You save $54,000 (about 10% overall) compared to DAS and still get the benefits of shared storage. That’s $56,000 you can spend on more advanced broker software or possibly a trip to the Bahamas.

Now if you’re wondering where I got my figures from, I did the same sizing exercise I did in Part 7 of this post but using 12 IOPS per user and using 33:63 R:W ratio. I then came up with a configuration and asked one of my colleagues for a street price. The figure came out to around $US50/desktop user for an NFS deployment, which is inline with what NetApp has been saying for about our costs for VDI deployments for some time now.

Even factoring in things like professional services, additional network infrastructure, training etc, you’d still be better off from a up-front expenditure point of view using NetApp than you would with internal disks.

Given the additional OpEx benefits, I wonder why anyone would even consider using DAS, or even for that matter another vendors SAN.

Data Storage for VDI – Part 8 – Misalignment

July 21, 2010 8 comments

If you follow NetApp’s best practice documentation all of the stuff I talked about works as well, if not better than outlined at the end of my previous post. Having said that it’s worth repeating that there are some workloads that are very difficult to optimize, and some configurations that don’t allow the optimization algorithms to work, the most prevalant of which is misaligned I/O.

If you follow best practice guidelines (and we all do that now don’t we …) then you’ll be intimately familiar with NetApp’s Best Practices for File System Alignment in Virtual Environments. If on the other hand you’re like pretty much everyone that went to the the Vmware course I attended, then you may be of the opinion that it doesn’t make that much of a difference. I suspect that if you I asked your opinion about whether you should go to the effort to ensure that your guest O/S partions are aligned, your response would probably fall into one of the following categories

  1. Unnecessary
  2. Not Recommended by VMware (They do, but I’ve heard people say this in the past)
  3. Something I should do when I can arrange some downtime during the Christmas holidays
  4. What you talking about Willis ?

If there is one thing I’d like you to take away from this post, it is the incredible importance of aligning your guest operating systems. After the impact of old school backups and virus scans, it’s probably the leading cause of poor performance at the storage layer. This is particularly true if you have minimized the number of spindles in your environment by using single instancing technologies such as FAS deduplication.

Of course this being my blog, I will now go into painful detail to show why it’s so important, if you’re not interested or have already ensured that everything is perfectly aligned, stop reading and wait until I post my next blog entry 🙂

Every disk reads and writes its data in fixed block sizes, usually either 512 or 520 bytes which effectively stores 512bytes user data and 8 bytes of checksum data. Furthermore  the storage arrays I’ve worked with that get a decent number of IOPS/spindle all use some multiple of these 512 bytes of user data as the smallest chunk that it stored in cache, usually 4KiB or some multiple thereof. The arrays then perform reads and writes of data using to and from disks using these these chunks along with the appropriate checksum information. This works well because most applications and filesystems on LUNs / VMDKs / VHD’s etc also write in 4K chunks. In a well configured environment, the only time you’ll have a read or more importantly a write request that is not some multiple of 4K is in NAS workloads, where overwrite requests can happen across a range of bytes rather than a range of blocks, but even then it’s a rare occurrence.

Misalignment of I/O though causes a write from a guest to partially write to two different blocks which explained with pretty diagrams in Best Practices for File System Alignment in Virtual Environments, however that document doesnt quite stress how much of a performance impact this can have when compared to niceley aligned workloads, so I’ll spend a bit of time on this here.

When you completely overwrite a block in its entirety, an arrays job is trivially easy,

  1. Accept the block from the client and put it in the one of the write cache’s block buffers
  2. Seek to the block you’re going to write to
  3. Write the block

Net result = 1 seek + 1 logical write operation (plus any RAID overheads)

However when you send an unaligned block, things get much harder for the array

  1. Accept a block worth of data from the client, put some of it in one of the block buffers in the arrays write cache, put the rest of it into the adjacent block buffer. Neither of these block buffers will be completely full however, which is bad.
  2. If you didn’t already have the blocks that are going being overwritten in the read cache, then
    1. Seek to where the two blocks start
    2. read the 2 blocks from the disk to get the parts you don’t know about
    3. Merge the information you just read from disk / read cache with the blocks worth of data you received from the client
    4. Overwrite the two blocks with the data you just merged together

Net result = 1 seek + some additional CPU + double write cache consumption + 2 additional 4K reads, and one additional 4K write (plus any RAID overheads) + inneficient space consumption.

The problem as you’ll see isn’t so much a misaligned write as such, but the partial block writes that it generates. In well configured “Block” environments (FC / iSCSI), you simply won’t ever see a partial write, however in “File” environments (CIFS/NFS) environments, partial writes are a relatively small, but expected part of many workloads. Because FAS arrays are truly unified for both block and file, Data ONTAP has some sophisticated methods of detecting partial writes, holding them in cache, combining them where possible, and committing them to disk as efficiently as possible. Even so, partial writes are really hard to optimize well.

There are many clever ways of optimizing caching alogrithms to mitigate the impact of partial writes, and NetApp combines a number of these in ways that I’m not at liberty to disclose outside of NetApp. We developed  these otptions because a certain amount of bad partial write behavior is expected from workloads targeted at a FAS controller, and much like it is with our kids at home, tolerating a certain amount of “less than wonderful” behavior without making a fuss allows the household to run harmoniously. But this tolerance has its limit and after a point it needs to be pulled into line. While Data ONTAP can’t tell a badly behaved application to sit quietly in the corner an consider how its behavior is affecting others, it can mitigate the impact on partial writes on well behaved applications.

Unfortunately environments that do wholesale P2V migrations of WinXP desktops without going through an alignment exercise, will almost certainly generate large number of misaligned writes. While Data ONTAP does what it can to maintain the highest performance it can under those circumstances, these misaligned writes much harder to optimise, which in turn will probably have a non-trivial impact on the overall performance by multiplying the number of I/O’s  required to meet the workload requirements.

If you do have lots of unaligned I/O in your environment, you’re faced with one of four options.

  1. Use the tools provided by NetApp and others like VisionCore to help you bring things back into alignment
  2. Put in larger caches. Larger caches, especially megacaches such as  FlashCache means the data needed to complete the partial write will already be in memory, or at least on a medim that allows sub millisecond read times for the data required to complete partial writes.
  3. Put in more disks, if you distribute the load amongst more spindles, then the read latency imposed by partial writes will be reduced
  4. Live with the reduced performance and unhappy users until your next major VDI refresh

Of course the best option is to avoid misaligned I/O in the first place by following Best Practices for File System Alignment in Virtual Environments. This really is one friendly manual that is worth following regardless of whether you use NetApp storage or something else.

To summarise – misaligned I/O and partial writes are evil and they must be stopped .

Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles

July 19, 2010 12 comments

The nice thing from my point of view is that because VDI’s steady state performance is characterized by a high percentage of random writes and high concurrency, the performance architecture of Data ONTAP has been well optimized for VDI for quite some time, in fact since before VDI was really  focus for anyone. As my dad said to me once, “Sometimes its better to be lucky than it is to be good” 🙂

As proof of this, I used our internal VDI sizing tools for

  • 1000 users
  • 50% Read, 50% Writes
  • 10 IOPS/second
  • 10GB single instanced (using FAS Deduplication) Operating system image
  • 0.5 GB RAM per Guest (used to factor the vSwap requirements)
  • 1 GB of Unique data per user (deliberately low to keep the focus on the number of disks required for IOPS)
  • 20ms read response times
  • WAFL filesystem 90% Full

The sizer came back with needing only 24 FC disks to satisfy the IOPS requirement on our entry level 2040 controller without needing any form of SSD or extra accelerators.

That works out to over 400 IOPS / 15K disk or about 40 users per 15K disk, 400% better than the 10 users per 15K RAID-DP spindle predicted by Ruben’s model. For the 20% Read 80% write example, the numbers are even better with only 18 disks on the FAS-2040 which is 555 IOPs or 55 users per disk vs. the 9 predicted by Rubens model (611% better than predicted). To see how this compares to other SAN arrays, check out the following table which outlines the expected efficiencies from RAID 5, 10, and 6 for VDI workloads.

Read IEF Write IEF Overall Efficiency at 30:70 R/W Overall Efficiency at 50:50 R/W
RAID-5 100% 25% 47.5% 62.5%
RAID-10 100% 50% 65% 75%
RAID-6 100% 17% 41.9% 58.5%
RAID-DP + WAFL 90% Full 100% 200-350% 230% 170%

The really interesting thing about these results is that as the workload becomes more dominated by write traffic, RAID-DP+WAFL gets even greater efficiencies. At a 50:50 workload the write IEF is around 240%, however at 30:70 workload the write IEF is close to 290%. This happens because random reads inevitably cause more disk seeks, whereas writes are pretty much always sequential.

Don’t get me wrong, I think Ruben did outstanding work, and something which I’ve learned a lot from, but when it comes to sizing NetApp storage by I/O I think he was working with some inaccurate or outdated data that led him to some erroneous conclusions which I hope I’ve been able to clarify in this blog.

In my next post, I hope to cover how megacaching techniques such as NetApp’s FlashCache can be used in VDI environments and a few specific configuration tweaks that can be used on a NetApp array to improve the performance of your VDI environment.

Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance

July 19, 2010 2 comments

WAFL,  Metadata Reads and SRAWR

This brings us to reads, WAFL allows us to excel at writes, but what about reads ? I’ve already stated that compared to other RAID configurations RAID-DP is about 13% worse for reads, so what does WAFL do to offset that ? Well to start with, it can actually make things worse (and yes, I still work for NetApp). Why and how does this happen? Well, remember that WAFL is a fine-grained storage virtualisation layer, we map, and can remap the physical whereabouts of every single 4K block. In order to find the block you’re want to read, the array needs to consult this map. Old school traditional SAN array controllers don’t need to do this, they are use an algorithm like base+offset to find the requested block, or they map larger chunks (e.g. 250Kb) and  they pin a much smaller map inside of the array’s cache. Because the WAFL map (the metadata) is relatively large, historically, only a portion of it stays in memory cache. When the active working set is very large, WAFL will probably need to do two back-end disk reads for a majority of the front-end reads, one for metadata and one for the data.

The combination of losing read spindles to dedicated parity drives, and then losing more IO bandwidth to metadata reads can put Data ONTAP at a disadvantage for workloads with a high percentage of random reads. But wait, there’s more ! There’s one more issue which is occasionally thrown at us by our competitors. Sometimes known as “Sequential Reads After Random Writes” or SRARW can be a problem for WAFL (and, I’d imagine other similar data layout engines such as ZFS that use mapping rather than algorithms to locate data). The reason for this is that turning random writes into sequential writes can mean that sequential reads get turned into random reads, and that has a fairly negative impact on sequential read performance.

Now before I go into this in detail, keep in mind that for the vast majority of VDI deployments this is not a problem. The only time people really tend to notice this is during old school bulk data copy style backups and database integrity checks. Having said that there are a number of things NetApp does to mitigate the SRARW effect.

WAFL and Temporal Locality of Reference

Firstly, another way of looking at things is that what WAFL does is to exchange “spatial locality of reference” for “temporal locality of reference”. For example, when you write a file into a filesystem like NTFS, or update a database record, you will typically update the MFT or indexes at the same time. Regardless of the apparent logical layout where the MFT or indexes are stored on different regions within the same LUN, WAFL will place all of these updates close to each other on the disk/disks. Similarly in a VDI deployment a write of a single file to a fragmented windows filesystem might logically be written to multiple locations on its disk, but they will all be stored together close together on the disk on the NetApp array. In VDI and OLTP environments, this is a good thing, because in order to access a file or record, you first access the MFT or index which then points to the data you’re after. Guess what ! because of the fact that all parts of the file and its metadata are all laid out close to each other, there is a very good chance that you won’t need to do a seek+settle to get the heads to the data portions resulting in much improved disk reads. In effect, this allows a FAS array to do inline physical defragmentation of guest.

Readsets

Data ONTAP is able to combine this temporal locality of reference with a little publicized feature called a read-set. A read-set a record of which sets of data are regularly read together and is stored along with the rest of the metadata in WAFL. This provides a level of semantic knowledge about the underlying data that the readadhead algorithm uses to ensure that in most cases, the data has already been requested and read into read cache before VDI client sends down its next read request.

Reallocation

Secondly (and this really applies more to Database environments than it does to VDI, but I’ll include it here for the sake of completeness) there are techniques which completely address the SRARW issue..

1. WAR (woah woah woah,, what is it good for) ..

As it turns out this kind of WAR is good for quite a few workload types because it stands for “write after read”. This feature has been available since Data ONTAP 7.3.1, and when enabled for a volume, it senses when you’ve requested bunch of data that is logically sequential, figures out if it had to do an excessive number of random reads at the back end, and if so, finds a nice clean area to write this stuff out, so that the next time you do the same logically sequential read, it is nicely sequentially layed out in a physical sense. I’ve done some tests in hostile environments (a month of running10+ hours  every day of completely random reads followed by a complete sequential integrity check of an exchange 2007 database), and the WAR option increased the sequential scan time by about 15%. A subsequent scan of the same database took 40% less time (and it probably would have been faster if I hand hit a client CPU bottleneck during the integrity check).

2. Regular reallocation scans.

These are recommended as a default best practice for database LUNs in the Data ONTAP administration guide, though it seems that nobody actually reads this friendly manual, so it still doesn’t seem to be common practice. These scans execute every night, and run a complete reallocation of only the “fragmented” blocks. Based on some experiments I did on a 3040 with 12 spindles, this works at about 100+ GB per hour, so for a 4TB database with a 2% daily change rate, a nightly reallocate would take about an hour. This might seem like an imposition, but if you’ve cut your nightly backup window by 8 hours due to cool snapshots and SnapVault/SnapMirror, then adding back an hour to optimise the performance isn’t a big ask. This also creates some nice clean free-space areas, which keeps the write performance nice and snappy. As a side effect, regular reallocations mean that any disks added to the aggregate will quickly get hot data evenly spread across them thereby improving read performance even more.

It should be noted that these two techniques don’t work with deduplicated volumes. If you believe you will be running a lot of single threaded sequential reads in your VDI environment, you should consider placing those workloads on a volume which does not have deduplication turned on, and possibly use one of the other single instancing technologies such as Vmware View in combination with one of the techniques described above.

As I said before, I’ve included those two points for the sake of completeness, but for VDI environments where the I/O profile is almost completely random, WAFLs default behavior of a data layout based on temporal locality of reference will give you better performance than a layout based on spatial locality of reference as used by traditional arrays.

Thats soooo random

At this stage it might be worthwhile noting that random reads and writes aren’t truly random , they are merely “non sequential”, there are few truly random things outside of the world of mathematics, storage benchmarks, and quantum physics. It is this that allows the fuzzy logic in Data ONTAP’s read-ahead algorithms to do their remarkable work. NetApp spent a lot of time and brainpower on creating and fine-tuning these, and I’m confident that they are unsurpassed by any other storage array. This is where I’d like to extensively quote another section out of Ruben’s excellent article with some additions of my own.

The NTFS filesystem on a Windows client uses 4 kB blocks by default. Luckily, Windows tries to optimize disk requests to some extent by grouping block requests together if, from a file perspective, they are contiguous [which the readset feature in Data ONTAP is built to recognise]. That means it is important that files are defragged [Except in Data ONTAP where WAFL has already stored these logically fragmented files physically close to each other thanks to the magic of temporal locality] ….. Therefore it is best practice to disable defragging completely once the master image is complete [Which might be a concern without the performance optimisations built into Data ONTAP] The same goes for prefetching. Prefetching is a process that puts all files read more frequently in a special cache directory in Windows, so that the reading of these files becomes one contiguous reading stream, minimizing IO and maximizing throughput. But because IOs from a large number of clients makes it totally random from a storage point of view, prefetching files no longer matters and the prefetching process only adds to the IOs once again. So prefetching should also be completely disabled. [however Data ONTAP effectively  and transparently restores this performance enhancement thanks to the way readsets work with Data ONTAPs prefetch/readahead capabilities] If the storage is de-duplicating the disks, moving files around inside those disks will greatly disturb the effectiveness of de-duplication. That is yet another reason to disable features like prefetching and defragging. [not to mention that for the most part, that with Data ONTAP it’s completely unncecesary]

Aggregating Disk IOPs

Another thing that helps NetApp is the concept of aggregates which makes it a lot easier to recruit the collective IOPs of all the spindles in an array rather than having IOPs trapped and wasted within small RAID groups, in principal, its’ similar to the closely related concept of wide striping. It also globalises the pool of free blocks which made the write allocator’s job much easier. This combination of readsets, hyper-efficient writes and the ability to recruit a lot of spindles to the read workloads means that for most real world workloads, NetApp is as fast, if not faster than equivalently configured arrays from other vendors which was nicely shown in independently audited industry standard SPC-1 benchmarks .

What you might have heard …

For me though, one of the main proofs of the effectiveness of these techniques is that pretty much every “benchmark” run on our kit by our competitors tries to ensure that none of these features are used. I’ve seen things like using artificial 100% completely random workloads to ensure that readsets cant be used, unrealistically large working sets to ensure the maximum number of metadata reads, and really small aggregates, misaligned I/O and other non best practice configurations to make the write allocators’ job as hard as possible. It’s said that that all is fair in love and IT marketing, but the shenanigans that some vendors get up to discredit Data ONTAP’s performance architecture often goes beyond the bounds of professional conduct.

Moving right along

OK, now I have that off my chest, I can move on to the next part of my blog Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles where I’ll show how Data ONTAP can help reduce the storage costs for VDI to the point where you can afford to use world class shared storage without the availability and managability compromises involved with DAS and other forms of cut price storage.

Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator

July 19, 2010 2 comments

A lot has been written about WAFL, but for the most part, I still think its widely misunderstood, even by some folks within NetApp. The alignment between the kind of fine grained storage virtualisation you get out of WAFL and other forms of compute and network virtualisation, is sometimes hard to appreciate until you’ve had a chance to really get into the guts of it.

Firstly WAFL means we can write any block to any location, and we use the capability to turn random writes at the front end into sequential writes at the back end. When we have a brand new system, we are able to do  full stripe writes to the underlying RAID groups, and the write coalescing works with perfect efficiency without needing to use much if any write cache. If we have a RAID group consisting of 14 data disks and 2 parity disks (the default setting), then a simple way of looking at our write efficiency starts out like this – 14 writes come in, 16 writes go to the back end,  14:16 or 87.5% efficiency, something that makes RAID-10 look a little sick in comparison.

Of course, the one thing that our competitors seem almost duty bound to point out is that as WAFL’s capacity fills, the ability to do full strip writes diminishes, which is true, but only up to a point. The following graph shows what would happen to this write efficiency advantage as WAFL fills up assuming that the data is uniformly and randomly distributed across the entire RAID set, and that we had no other way of optimizing performance.

The nice thing about this graph is that it is simple, its reasonably intuitive, and it shows our random write performance stays nicely above RAID-10 until we are about 60% of the available capacity of a RAID-10 array with the same number of spindles. Now before the likes of @HPstorageguy have a field day, I’d like to point out that this graph/model, like many other simple and intuitive things, such as the idea that the world is flat and that the sun revolves around us, is wrong or at least misleading. The main reason it is misleading is because it underestimates Data ONTAP’s ability to exploit data usage patterns that happen in the real world.

This next section is pretty deep, you don’t need to understand it, but it does demonstrate how abstracting away a lot of the detail can lead you to bad conclusions. If you’re not that interested, or are time poor and you’re willing to take a leap of faith and believe me when I say that WAFL is able to maintain extremely high write performance even when the array is almost full, jump down to the text under the next graphic, otherwise feel free to read on.

Firstly, Data ONTAP does an excellent job of using allocation areas that are much emptier than the system is on average.  This means that if the system is 80% full then WAFL is typically writing to free space that is, perhaps, 40% full. The RAID system also combines logically separate operations into more efficient physical operations.

Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten.  First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous.  We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads).  When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over.  Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2.   The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated.  We will write .75 * 14 * 32 blocks, or 336 blocks.  The writes will be performed in 16 operations (1 for each data disk, 1 for each parity).  On each parity we’ll issue 1 read.  There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative).  There are 4 * 14 + 2 read operations, or 58 read operations.  Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%, not the 67% as predicted by the graph above. That is the good news.  However, life is rarely this good, for example, not all random writes are 4K random writes.  If customers start doing 8K random writes, then these 336 blocks are only 168 operations, for 227% efficiency.  Furthermore, there is metadata.  How much metadata is very sharply dependent on the workload.  In worst case situations, WAFL can write about as much metadata as data, this is much higher than real-world, but if we go with that ratio, then 336 blocks becomes 84 operations. This give us pretty much a worst case Write IEF of 113% when almost everything is going against us, which is better even than you’d get from most RAID-0 configurations, and twice as good as RAID-10.

Theory is all well and good, but to see how this works in practice, look at the following graph of  a real world scenario. Here we have a bunch of small aggregates, each with  28 15K disk servicing over 4000 8K IOPS 53:47 read/write ratio (Exchange 2007), with aggregate space utilisation above 80%. The main thing to note on this graph is the latency, during this entire time the write latency (the purple line at the bottom) was flat at about 1ms. Read latency was about 6 ms, except for a slight (1 – 2 ms) increase for read latency across one of the LUNs during a RAID reconstruct (represented by the circled points 1 and 2 on the graph)

I see this across almost every NetApp array on which I’ve had the chance to do a performance analysis. Read latencies are around about the same as a traditional SAN array, but write latency is consistently very low, even on our smallest controllers. In general a NetApp array’s ability to service random write requests is only limited by the rate at which sequential writes can be written to the back end disks which gives us SSD levels of random write performance from good old spinning rust. Ruben may have been gracious by assuming that we were achieving the same kind of write performance from RAID-DP as you might get from a traditional RAID-10 layout, but theory, benchmarks, and real world experience says that RAID-DP + WAFL generally does a lot better than that. In most VDI deployments I’d expect to see much better than 150% Write IEF.

For write intensive workloads like VDI, this is excellent news, but writes are only half (or maybe 70%) of the story, which brings me to my next post Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance

Data Storage for VDI – Part 4 – The impact of RAID on performance

July 19, 2010 4 comments

As I said at the end of my previous blog post

The read and write cache in traditional modular arrays are too small to make any significant difference to the read and write efficiencies of the underlying RAID configuration in VDI deployments

The good thing is that this makes calculating the Overall I/O Efficiency Factor (IEF) for traditional RAID configurations pretty straightforward. The overall IEF will depend on the kind of RAID, and the mixture of reads and writes using the following formula

Overall IEF = (Read% * read IEF) + (Write% * write IEF).

To start, with RAID-5, a single front-end  write IOP requires 4 back-end IOPs, giving a write IEF of 25%. If you had 28 * 15K spindles in a RAID-5 configuration, this means you can only sustain 235 * 28 * 25% = 1645 IOPS at 20ms.

Using Rubens or a 30:70 VDI steady state read:write workload the Overall IEF for RAID-5 would be

(30 * 100%) + (70 * 25%) = 47.5%.

For a 50:50 workload, the Overall IEF would be

(50 * 100%) + (50 * 25%) = 75%

For RAID-10 you sacrifice half of your capacity, but instead of there being 4 IOPS for every 1 front end write there are 2 for an write IEF of 50%.  The write coalescing caching tricks also add benefit to RAID-10 but again, not sufficiently to make any significant effect.

So how about RAID-6, with RAID-6, every front end write I/O  requires 6 IOPS at the back end or an uncached  Write IOE about 17% and a cached Write IOE of about 27%. Reads for non-NetApp RAID-6 based on Reed-Solomon algorithms are yet again, unaffected.

So, what about RAID-DP ? Well, much as I hate to say it, even though it is a form or RAID-6, by itself it has the worst of performance of all the RAID schemes (and yes I do still work for NetApp).

Why ? Because RAID-DP, like RAID-4 uses dedicated parity disks. Given that, by default, one disk in every 8 is dedicated to parity and can’t be used for data reads, both RAID-4, and RAID-DP immediately take a 13% hit on reads. In addition, just like RAID-6 every front end random write IOP can require up to 6 write IOPS at the back end This would mean that NetApp has the same write performance as RAID-6 and 13% worse read performance.

This gives the following results  for overall IEF for the 30:70 read:write usecase

(30 * 87%) + (70 * 17%) = 38.40   (!!)

This is exactly the kind of reasoning our competitors use when explaining our technology to others.

So why would NetApp be insane enough to make RAID-DP the default configuration? How have we succeeded so well in the market place ? Shouldn’t there be a tidal wave of unhappy NetApp customers demanding their money back?

Well there are a few reasons we use RAID-DP as the default configuration for all NetApp arrays. The first is that dedicated parity drives makes RAID reconstructs fast with minimal performance impact. It also makes it trivially easy to add disks to RAID groups non-disruptively. “This might be great for availability, but what about performance ?” I hear you ask. Well I’ve been told that you can mathematically prove that the RAID-DP algorithms are the most efficient possible way of doing dual parity RAID, frankly the math is beyond me, but the CPU consumption by the RAID layer is really minimal. The real magic however happens because RAID-DP is always combined with WAFL.

This isnt a good place to explain everything I know about WAFL, and others have already done it better that I probably can (cf Kostadis’ Blog), but I’ll outline the salient benefits from a performance point in the next post Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator