The nice thing from my point of view is that because VDI’s steady state performance is characterized by a high percentage of random writes and high concurrency, the performance architecture of Data ONTAP has been well optimized for VDI for quite some time, in fact since before VDI was really focus for anyone. As my dad said to me once, “Sometimes its better to be lucky than it is to be good” 🙂
As proof of this, I used our internal VDI sizing tools for
- 1000 users
- 50% Read, 50% Writes
- 10 IOPS/second
- 10GB single instanced (using FAS Deduplication) Operating system image
- 0.5 GB RAM per Guest (used to factor the vSwap requirements)
- 1 GB of Unique data per user (deliberately low to keep the focus on the number of disks required for IOPS)
- 20ms read response times
- WAFL filesystem 90% Full
The sizer came back with needing only 24 FC disks to satisfy the IOPS requirement on our entry level 2040 controller without needing any form of SSD or extra accelerators.
That works out to over 400 IOPS / 15K disk or about 40 users per 15K disk, 400% better than the 10 users per 15K RAID-DP spindle predicted by Ruben’s model. For the 20% Read 80% write example, the numbers are even better with only 18 disks on the FAS-2040 which is 555 IOPs or 55 users per disk vs. the 9 predicted by Rubens model (611% better than predicted). To see how this compares to other SAN arrays, check out the following table which outlines the expected efficiencies from RAID 5, 10, and 6 for VDI workloads.
|Read IEF||Write IEF||Overall Efficiency at 30:70 R/W||Overall Efficiency at 50:50 R/W|
|RAID-DP + WAFL 90% Full||100%||200-350%||230%||170%|
The really interesting thing about these results is that as the workload becomes more dominated by write traffic, RAID-DP+WAFL gets even greater efficiencies. At a 50:50 workload the write IEF is around 240%, however at 30:70 workload the write IEF is close to 290%. This happens because random reads inevitably cause more disk seeks, whereas writes are pretty much always sequential.
Don’t get me wrong, I think Ruben did outstanding work, and something which I’ve learned a lot from, but when it comes to sizing NetApp storage by I/O I think he was working with some inaccurate or outdated data that led him to some erroneous conclusions which I hope I’ve been able to clarify in this blog.
In my next post, I hope to cover how megacaching techniques such as NetApp’s FlashCache can be used in VDI environments and a few specific configuration tweaks that can be used on a NetApp array to improve the performance of your VDI environment.
WAFL, Metadata Reads and SRAWR
This brings us to reads, WAFL allows us to excel at writes, but what about reads ? I’ve already stated that compared to other RAID configurations RAID-DP is about 13% worse for reads, so what does WAFL do to offset that ? Well to start with, it can actually make things worse (and yes, I still work for NetApp). Why and how does this happen? Well, remember that WAFL is a fine-grained storage virtualisation layer, we map, and can remap the physical whereabouts of every single 4K block. In order to find the block you’re want to read, the array needs to consult this map. Old school traditional SAN array controllers don’t need to do this, they are use an algorithm like base+offset to find the requested block, or they map larger chunks (e.g. 250Kb) and they pin a much smaller map inside of the array’s cache. Because the WAFL map (the metadata) is relatively large, historically, only a portion of it stays in memory cache. When the active working set is very large, WAFL will probably need to do two back-end disk reads for a majority of the front-end reads, one for metadata and one for the data.
The combination of losing read spindles to dedicated parity drives, and then losing more IO bandwidth to metadata reads can put Data ONTAP at a disadvantage for workloads with a high percentage of random reads. But wait, there’s more ! There’s one more issue which is occasionally thrown at us by our competitors. Sometimes known as “Sequential Reads After Random Writes” or SRARW can be a problem for WAFL (and, I’d imagine other similar data layout engines such as ZFS that use mapping rather than algorithms to locate data). The reason for this is that turning random writes into sequential writes can mean that sequential reads get turned into random reads, and that has a fairly negative impact on sequential read performance.
Now before I go into this in detail, keep in mind that for the vast majority of VDI deployments this is not a problem. The only time people really tend to notice this is during old school bulk data copy style backups and database integrity checks. Having said that there are a number of things NetApp does to mitigate the SRARW effect.
WAFL and Temporal Locality of Reference
Firstly, another way of looking at things is that what WAFL does is to exchange “spatial locality of reference” for “temporal locality of reference”. For example, when you write a file into a filesystem like NTFS, or update a database record, you will typically update the MFT or indexes at the same time. Regardless of the apparent logical layout where the MFT or indexes are stored on different regions within the same LUN, WAFL will place all of these updates close to each other on the disk/disks. Similarly in a VDI deployment a write of a single file to a fragmented windows filesystem might logically be written to multiple locations on its disk, but they will all be stored together close together on the disk on the NetApp array. In VDI and OLTP environments, this is a good thing, because in order to access a file or record, you first access the MFT or index which then points to the data you’re after. Guess what ! because of the fact that all parts of the file and its metadata are all laid out close to each other, there is a very good chance that you won’t need to do a seek+settle to get the heads to the data portions resulting in much improved disk reads. In effect, this allows a FAS array to do inline physical defragmentation of guest.
Data ONTAP is able to combine this temporal locality of reference with a little publicized feature called a read-set. A read-set a record of which sets of data are regularly read together and is stored along with the rest of the metadata in WAFL. This provides a level of semantic knowledge about the underlying data that the readadhead algorithm uses to ensure that in most cases, the data has already been requested and read into read cache before VDI client sends down its next read request.
Secondly (and this really applies more to Database environments than it does to VDI, but I’ll include it here for the sake of completeness) there are techniques which completely address the SRARW issue..
1. WAR (woah woah woah,, what is it good for) ..
As it turns out this kind of WAR is good for quite a few workload types because it stands for “write after read”. This feature has been available since Data ONTAP 7.3.1, and when enabled for a volume, it senses when you’ve requested bunch of data that is logically sequential, figures out if it had to do an excessive number of random reads at the back end, and if so, finds a nice clean area to write this stuff out, so that the next time you do the same logically sequential read, it is nicely sequentially layed out in a physical sense. I’ve done some tests in hostile environments (a month of running10+ hours every day of completely random reads followed by a complete sequential integrity check of an exchange 2007 database), and the WAR option increased the sequential scan time by about 15%. A subsequent scan of the same database took 40% less time (and it probably would have been faster if I hand hit a client CPU bottleneck during the integrity check).
2. Regular reallocation scans.
These are recommended as a default best practice for database LUNs in the Data ONTAP administration guide, though it seems that nobody actually reads this friendly manual, so it still doesn’t seem to be common practice. These scans execute every night, and run a complete reallocation of only the “fragmented” blocks. Based on some experiments I did on a 3040 with 12 spindles, this works at about 100+ GB per hour, so for a 4TB database with a 2% daily change rate, a nightly reallocate would take about an hour. This might seem like an imposition, but if you’ve cut your nightly backup window by 8 hours due to cool snapshots and SnapVault/SnapMirror, then adding back an hour to optimise the performance isn’t a big ask. This also creates some nice clean free-space areas, which keeps the write performance nice and snappy. As a side effect, regular reallocations mean that any disks added to the aggregate will quickly get hot data evenly spread across them thereby improving read performance even more.
It should be noted that these two techniques don’t work with deduplicated volumes. If you believe you will be running a lot of single threaded sequential reads in your VDI environment, you should consider placing those workloads on a volume which does not have deduplication turned on, and possibly use one of the other single instancing technologies such as Vmware View in combination with one of the techniques described above.
As I said before, I’ve included those two points for the sake of completeness, but for VDI environments where the I/O profile is almost completely random, WAFLs default behavior of a data layout based on temporal locality of reference will give you better performance than a layout based on spatial locality of reference as used by traditional arrays.
Thats soooo random
At this stage it might be worthwhile noting that random reads and writes aren’t truly random , they are merely “non sequential”, there are few truly random things outside of the world of mathematics, storage benchmarks, and quantum physics. It is this that allows the fuzzy logic in Data ONTAP’s read-ahead algorithms to do their remarkable work. NetApp spent a lot of time and brainpower on creating and fine-tuning these, and I’m confident that they are unsurpassed by any other storage array. This is where I’d like to extensively quote another section out of Ruben’s excellent article with some additions of my own.
The NTFS filesystem on a Windows client uses 4 kB blocks by default. Luckily, Windows tries to optimize disk requests to some extent by grouping block requests together if, from a file perspective, they are contiguous [which the readset feature in Data ONTAP is built to recognise]. That means it is important that files are defragged [Except in Data ONTAP where WAFL has already stored these logically fragmented files physically close to each other thanks to the magic of temporal locality] ….. Therefore it is best practice to disable defragging completely once the master image is complete [Which might be a concern without the performance optimisations built into Data ONTAP] The same goes for prefetching. Prefetching is a process that puts all files read more frequently in a special cache directory in Windows, so that the reading of these files becomes one contiguous reading stream, minimizing IO and maximizing throughput. But because IOs from a large number of clients makes it totally random from a storage point of view, prefetching files no longer matters and the prefetching process only adds to the IOs once again. So prefetching should also be completely disabled. [however Data ONTAP effectively and transparently restores this performance enhancement thanks to the way readsets work with Data ONTAPs prefetch/readahead capabilities] If the storage is de-duplicating the disks, moving files around inside those disks will greatly disturb the effectiveness of de-duplication. That is yet another reason to disable features like prefetching and defragging. [not to mention that for the most part, that with Data ONTAP it’s completely unncecesary]
Aggregating Disk IOPs
Another thing that helps NetApp is the concept of aggregates which makes it a lot easier to recruit the collective IOPs of all the spindles in an array rather than having IOPs trapped and wasted within small RAID groups, in principal, its’ similar to the closely related concept of wide striping. It also globalises the pool of free blocks which made the write allocator’s job much easier. This combination of readsets, hyper-efficient writes and the ability to recruit a lot of spindles to the read workloads means that for most real world workloads, NetApp is as fast, if not faster than equivalently configured arrays from other vendors which was nicely shown in independently audited industry standard SPC-1 benchmarks .
What you might have heard …
For me though, one of the main proofs of the effectiveness of these techniques is that pretty much every “benchmark” run on our kit by our competitors tries to ensure that none of these features are used. I’ve seen things like using artificial 100% completely random workloads to ensure that readsets cant be used, unrealistically large working sets to ensure the maximum number of metadata reads, and really small aggregates, misaligned I/O and other non best practice configurations to make the write allocators’ job as hard as possible. It’s said that that all is fair in love and IT marketing, but the shenanigans that some vendors get up to discredit Data ONTAP’s performance architecture often goes beyond the bounds of professional conduct.
Moving right along
OK, now I have that off my chest, I can move on to the next part of my blog Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles where I’ll show how Data ONTAP can help reduce the storage costs for VDI to the point where you can afford to use world class shared storage without the availability and managability compromises involved with DAS and other forms of cut price storage.
A lot has been written about WAFL, but for the most part, I still think its widely misunderstood, even by some folks within NetApp. The alignment between the kind of fine grained storage virtualisation you get out of WAFL and other forms of compute and network virtualisation, is sometimes hard to appreciate until you’ve had a chance to really get into the guts of it.
Firstly WAFL means we can write any block to any location, and we use the capability to turn random writes at the front end into sequential writes at the back end. When we have a brand new system, we are able to do full stripe writes to the underlying RAID groups, and the write coalescing works with perfect efficiency without needing to use much if any write cache. If we have a RAID group consisting of 14 data disks and 2 parity disks (the default setting), then a simple way of looking at our write efficiency starts out like this – 14 writes come in, 16 writes go to the back end, 14:16 or 87.5% efficiency, something that makes RAID-10 look a little sick in comparison.
Of course, the one thing that our competitors seem almost duty bound to point out is that as WAFL’s capacity fills, the ability to do full strip writes diminishes, which is true, but only up to a point. The following graph shows what would happen to this write efficiency advantage as WAFL fills up assuming that the data is uniformly and randomly distributed across the entire RAID set, and that we had no other way of optimizing performance.
The nice thing about this graph is that it is simple, its reasonably intuitive, and it shows our random write performance stays nicely above RAID-10 until we are about 60% of the available capacity of a RAID-10 array with the same number of spindles. Now before the likes of @HPstorageguy have a field day, I’d like to point out that this graph/model, like many other simple and intuitive things, such as the idea that the world is flat and that the sun revolves around us, is wrong or at least misleading. The main reason it is misleading is because it underestimates Data ONTAP’s ability to exploit data usage patterns that happen in the real world.
This next section is pretty deep, you don’t need to understand it, but it does demonstrate how abstracting away a lot of the detail can lead you to bad conclusions. If you’re not that interested, or are time poor and you’re willing to take a leap of faith and believe me when I say that WAFL is able to maintain extremely high write performance even when the array is almost full, jump down to the text under the next graphic, otherwise feel free to read on.
Firstly, Data ONTAP does an excellent job of using allocation areas that are much emptier than the system is on average. This means that if the system is 80% full then WAFL is typically writing to free space that is, perhaps, 40% full. The RAID system also combines logically separate operations into more efficient physical operations.
Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten. First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous. We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads). When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over. Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2. The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated. We will write .75 * 14 * 32 blocks, or 336 blocks. The writes will be performed in 16 operations (1 for each data disk, 1 for each parity). On each parity we’ll issue 1 read. There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative). There are 4 * 14 + 2 read operations, or 58 read operations. Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%, not the 67% as predicted by the graph above. That is the good news. However, life is rarely this good, for example, not all random writes are 4K random writes. If customers start doing 8K random writes, then these 336 blocks are only 168 operations, for 227% efficiency. Furthermore, there is metadata. How much metadata is very sharply dependent on the workload. In worst case situations, WAFL can write about as much metadata as data, this is much higher than real-world, but if we go with that ratio, then 336 blocks becomes 84 operations. This give us pretty much a worst case Write IEF of 113% when almost everything is going against us, which is better even than you’d get from most RAID-0 configurations, and twice as good as RAID-10.
Theory is all well and good, but to see how this works in practice, look at the following graph of a real world scenario. Here we have a bunch of small aggregates, each with 28 15K disk servicing over 4000 8K IOPS 53:47 read/write ratio (Exchange 2007), with aggregate space utilisation above 80%. The main thing to note on this graph is the latency, during this entire time the write latency (the purple line at the bottom) was flat at about 1ms. Read latency was about 6 ms, except for a slight (1 – 2 ms) increase for read latency across one of the LUNs during a RAID reconstruct (represented by the circled points 1 and 2 on the graph)
I see this across almost every NetApp array on which I’ve had the chance to do a performance analysis. Read latencies are around about the same as a traditional SAN array, but write latency is consistently very low, even on our smallest controllers. In general a NetApp array’s ability to service random write requests is only limited by the rate at which sequential writes can be written to the back end disks which gives us SSD levels of random write performance from good old spinning rust. Ruben may have been gracious by assuming that we were achieving the same kind of write performance from RAID-DP as you might get from a traditional RAID-10 layout, but theory, benchmarks, and real world experience says that RAID-DP + WAFL generally does a lot better than that. In most VDI deployments I’d expect to see much better than 150% Write IEF.
For write intensive workloads like VDI, this is excellent news, but writes are only half (or maybe 70%) of the story, which brings me to my next post Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance
As I said at the end of my previous blog post
The read and write cache in traditional modular arrays are too small to make any significant difference to the read and write efficiencies of the underlying RAID configuration in VDI deployments
The good thing is that this makes calculating the Overall I/O Efficiency Factor (IEF) for traditional RAID configurations pretty straightforward. The overall IEF will depend on the kind of RAID, and the mixture of reads and writes using the following formula
Overall IEF = (Read% * read IEF) + (Write% * write IEF).
To start, with RAID-5, a single front-end write IOP requires 4 back-end IOPs, giving a write IEF of 25%. If you had 28 * 15K spindles in a RAID-5 configuration, this means you can only sustain 235 * 28 * 25% = 1645 IOPS at 20ms.
Using Rubens or a 30:70 VDI steady state read:write workload the Overall IEF for RAID-5 would be
(30 * 100%) + (70 * 25%) = 47.5%.
For a 50:50 workload, the Overall IEF would be
(50 * 100%) + (50 * 25%) = 75%
For RAID-10 you sacrifice half of your capacity, but instead of there being 4 IOPS for every 1 front end write there are 2 for an write IEF of 50%. The write coalescing caching tricks also add benefit to RAID-10 but again, not sufficiently to make any significant effect.
So how about RAID-6, with RAID-6, every front end write I/O requires 6 IOPS at the back end or an uncached Write IOE about 17% and a cached Write IOE of about 27%. Reads for non-NetApp RAID-6 based on Reed-Solomon algorithms are yet again, unaffected.
So, what about RAID-DP ? Well, much as I hate to say it, even though it is a form or RAID-6, by itself it has the worst of performance of all the RAID schemes (and yes I do still work for NetApp).
Why ? Because RAID-DP, like RAID-4 uses dedicated parity disks. Given that, by default, one disk in every 8 is dedicated to parity and can’t be used for data reads, both RAID-4, and RAID-DP immediately take a 13% hit on reads. In addition, just like RAID-6 every front end random write IOP can require up to 6 write IOPS at the back end This would mean that NetApp has the same write performance as RAID-6 and 13% worse read performance.
This gives the following results for overall IEF for the 30:70 read:write usecase(30 * 87%) + (70 * 17%) = 38.40 (!!)
This is exactly the kind of reasoning our competitors use when explaining our technology to others.
So why would NetApp be insane enough to make RAID-DP the default configuration? How have we succeeded so well in the market place ? Shouldn’t there be a tidal wave of unhappy NetApp customers demanding their money back?
Well there are a few reasons we use RAID-DP as the default configuration for all NetApp arrays. The first is that dedicated parity drives makes RAID reconstructs fast with minimal performance impact. It also makes it trivially easy to add disks to RAID groups non-disruptively. “This might be great for availability, but what about performance ?” I hear you ask. Well I’ve been told that you can mathematically prove that the RAID-DP algorithms are the most efficient possible way of doing dual parity RAID, frankly the math is beyond me, but the CPU consumption by the RAID layer is really minimal. The real magic however happens because RAID-DP is always combined with WAFL.
This isnt a good place to explain everything I know about WAFL, and others have already done it better that I probably can (cf Kostadis’ Blog), but I’ll outline the salient benefits from a performance point in the next post Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator
There’s been an interesting conversation thread about the availability implications of scale up vs scale out in server virtualisation, initially at yellowbricks, then over yonder, and most recently at Scott Lowe’s blog.
What I find interesting is that the question focussed almost entirely on the impact of server failure, and with the exception of two comments (one of them mine), so far, none have them have mentioned storage, or indeed any other part of the IT infrastructure.
OK, so maybe worrying about complete data center outages is not generally part of a system engineers brief, and there is an assumption that the networking and storage layers are configured with sufficient levels of redundancy that they are thought of as being 100% reliable. This might account for the lack of concern over how many VM’s are hosted on a particular storage array or LUN, or via a set of switches. Most of the discussions around how many virtual machines should be put within a given datastore seems to focus around performance, LUN queue depths, and the efficiency distributed vs centralised lock managers.
From a personal perspective, I don’t think this level of trust is entirely misplaced, but while most virtualisation savvy engineers seem to be on top of the reliability characteristics of high-end servers, there doesn’t seem to be a corresponding level of expertise in evaluating the reliability characteristics of shared storage systems.
Working for NetApp gives me access to a body of reasearch material that most people aren’t aware of. A lot of it is publically available, but it can be a little hard to get your head around.
Part of the problem is that there have been few formal studies published analyzing the reliability of storage system components. Early work done in 1989 presented a reliability model based on formula derived, and datasheet-specified MTTF of each component, assuming component failures follow exponential distributions and that failures were independent.
Models based on these assumptions and that systems should be modeled using homogenous Poisson processes remain in common use today, however research sponsored by NetApp shows that these models may severely underestimate the annual failure rates for important subsystems such as RAID and Disk Shelves/Disk Access Enclosures and their associated interconnects.
Two NetApp sponsored studies : “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf” and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf” Contain sophisticated models supported by field data for evaluating the reliability of various storage array configurations. These reports are a little dense, so I’ll summarise some of the key findings below.
- Physical interconnects failures make up the largest part (27-68%) of storage subsystem failures, disk failures make up the second largest part (20-55%).
- Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect.
- Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.
- State of the art disk reliability models yields estimates of Dual Drive Failures that are as much as 4,000 times greater than the commonly used Mean Time to Data Loss (MTTDL) based estimates
- Latent defects are inevitable, and scrubbing latent defects is imperative to RAID N + 1 ((RAID-4, RAID-5, RAID-1, RAID-10)) reliability. As HDD capacity increases, the number of latent defects will also increase and render the MTTDL method less accurate.
- Although scrubbing is a viable method to eliminate latent defects, there is a trade-off between serving data and scrubbing. As the demand on the HDD increases, less time will be available to scrub. If scrubbing is given priority, then system response to demands for data will be reduced. A second alternative to accept latent defects and increase system reliability is to increase redundancy to N + 2, (RAID-6). Configurations that utilize RAID-6, allow RAID scrubs to be deferred to times when their performance impact will not affect production workloads.
Another interesting thing you find is that once you start using these more sophisticated reliability models is that most RAID-5 raid sets have an availability percentage of around “three nines”, RAID-10, comes in at about “four nines”, and only RAID-6 gets close the magical figure of “five nines” of availability. Dont get me wrong, the array as an entire entity may have well over “five nines”, which is important because the failure of a single array can impact tens, if not hundreds of servers, but at the individual RAID group level the availability percentages are way below that.
In the good old days where the failure of a single RAID group generally affected a single server these kinds of availability percentages were probably ok, but when a LUN/RAID group is being used for a VMware datastore ;where the failure of a single RAID group may impact tens, or possible hundreds of virtual machines; the reliability of the RAID group becomes as important as the availability of whole array used to be.
If you’re going to put all your eggs in one basket, then you better be sure that your basket is clad in titanium with pneumatic padding. This applies not just to getting the most out of your servers, but needs to go all the way through the infrastructure stack down to the LUN and RAID groups that store all that critical data.