Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator
A lot has been written about WAFL, but for the most part, I still think its widely misunderstood, even by some folks within NetApp. The alignment between the kind of fine grained storage virtualisation you get out of WAFL and other forms of compute and network virtualisation, is sometimes hard to appreciate until you’ve had a chance to really get into the guts of it.
Firstly WAFL means we can write any block to any location, and we use the capability to turn random writes at the front end into sequential writes at the back end. When we have a brand new system, we are able to do full stripe writes to the underlying RAID groups, and the write coalescing works with perfect efficiency without needing to use much if any write cache. If we have a RAID group consisting of 14 data disks and 2 parity disks (the default setting), then a simple way of looking at our write efficiency starts out like this – 14 writes come in, 16 writes go to the back end, 14:16 or 87.5% efficiency, something that makes RAID-10 look a little sick in comparison.
Of course, the one thing that our competitors seem almost duty bound to point out is that as WAFL’s capacity fills, the ability to do full strip writes diminishes, which is true, but only up to a point. The following graph shows what would happen to this write efficiency advantage as WAFL fills up assuming that the data is uniformly and randomly distributed across the entire RAID set, and that we had no other way of optimizing performance.
The nice thing about this graph is that it is simple, its reasonably intuitive, and it shows our random write performance stays nicely above RAID-10 until we are about 60% of the available capacity of a RAID-10 array with the same number of spindles. Now before the likes of @HPstorageguy have a field day, I’d like to point out that this graph/model, like many other simple and intuitive things, such as the idea that the world is flat and that the sun revolves around us, is wrong or at least misleading. The main reason it is misleading is because it underestimates Data ONTAP’s ability to exploit data usage patterns that happen in the real world.
This next section is pretty deep, you don’t need to understand it, but it does demonstrate how abstracting away a lot of the detail can lead you to bad conclusions. If you’re not that interested, or are time poor and you’re willing to take a leap of faith and believe me when I say that WAFL is able to maintain extremely high write performance even when the array is almost full, jump down to the text under the next graphic, otherwise feel free to read on.
Firstly, Data ONTAP does an excellent job of using allocation areas that are much emptier than the system is on average. This means that if the system is 80% full then WAFL is typically writing to free space that is, perhaps, 40% full. The RAID system also combines logically separate operations into more efficient physical operations.
Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten. First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous. We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads). When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over. Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2. The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated. We will write .75 * 14 * 32 blocks, or 336 blocks. The writes will be performed in 16 operations (1 for each data disk, 1 for each parity). On each parity we’ll issue 1 read. There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative). There are 4 * 14 + 2 read operations, or 58 read operations. Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%, not the 67% as predicted by the graph above. That is the good news. However, life is rarely this good, for example, not all random writes are 4K random writes. If customers start doing 8K random writes, then these 336 blocks are only 168 operations, for 227% efficiency. Furthermore, there is metadata. How much metadata is very sharply dependent on the workload. In worst case situations, WAFL can write about as much metadata as data, this is much higher than real-world, but if we go with that ratio, then 336 blocks becomes 84 operations. This give us pretty much a worst case Write IEF of 113% when almost everything is going against us, which is better even than you’d get from most RAID-0 configurations, and twice as good as RAID-10.
Theory is all well and good, but to see how this works in practice, look at the following graph of a real world scenario. Here we have a bunch of small aggregates, each with 28 15K disk servicing over 4000 8K IOPS 53:47 read/write ratio (Exchange 2007), with aggregate space utilisation above 80%. The main thing to note on this graph is the latency, during this entire time the write latency (the purple line at the bottom) was flat at about 1ms. Read latency was about 6 ms, except for a slight (1 – 2 ms) increase for read latency across one of the LUNs during a RAID reconstruct (represented by the circled points 1 and 2 on the graph)
I see this across almost every NetApp array on which I’ve had the chance to do a performance analysis. Read latencies are around about the same as a traditional SAN array, but write latency is consistently very low, even on our smallest controllers. In general a NetApp array’s ability to service random write requests is only limited by the rate at which sequential writes can be written to the back end disks which gives us SSD levels of random write performance from good old spinning rust. Ruben may have been gracious by assuming that we were achieving the same kind of write performance from RAID-DP as you might get from a traditional RAID-10 layout, but theory, benchmarks, and real world experience says that RAID-DP + WAFL generally does a lot better than that. In most VDI deployments I’d expect to see much better than 150% Write IEF.
For write intensive workloads like VDI, this is excellent news, but writes are only half (or maybe 70%) of the story, which brings me to my next post Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance