Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance
WAFL, Metadata Reads and SRAWR
This brings us to reads, WAFL allows us to excel at writes, but what about reads ? I’ve already stated that compared to other RAID configurations RAID-DP is about 13% worse for reads, so what does WAFL do to offset that ? Well to start with, it can actually make things worse (and yes, I still work for NetApp). Why and how does this happen? Well, remember that WAFL is a fine-grained storage virtualisation layer, we map, and can remap the physical whereabouts of every single 4K block. In order to find the block you’re want to read, the array needs to consult this map. Old school traditional SAN array controllers don’t need to do this, they are use an algorithm like base+offset to find the requested block, or they map larger chunks (e.g. 250Kb) and they pin a much smaller map inside of the array’s cache. Because the WAFL map (the metadata) is relatively large, historically, only a portion of it stays in memory cache. When the active working set is very large, WAFL will probably need to do two back-end disk reads for a majority of the front-end reads, one for metadata and one for the data.
The combination of losing read spindles to dedicated parity drives, and then losing more IO bandwidth to metadata reads can put Data ONTAP at a disadvantage for workloads with a high percentage of random reads. But wait, there’s more ! There’s one more issue which is occasionally thrown at us by our competitors. Sometimes known as “Sequential Reads After Random Writes” or SRARW can be a problem for WAFL (and, I’d imagine other similar data layout engines such as ZFS that use mapping rather than algorithms to locate data). The reason for this is that turning random writes into sequential writes can mean that sequential reads get turned into random reads, and that has a fairly negative impact on sequential read performance.
Now before I go into this in detail, keep in mind that for the vast majority of VDI deployments this is not a problem. The only time people really tend to notice this is during old school bulk data copy style backups and database integrity checks. Having said that there are a number of things NetApp does to mitigate the SRARW effect.
WAFL and Temporal Locality of Reference
Firstly, another way of looking at things is that what WAFL does is to exchange “spatial locality of reference” for “temporal locality of reference”. For example, when you write a file into a filesystem like NTFS, or update a database record, you will typically update the MFT or indexes at the same time. Regardless of the apparent logical layout where the MFT or indexes are stored on different regions within the same LUN, WAFL will place all of these updates close to each other on the disk/disks. Similarly in a VDI deployment a write of a single file to a fragmented windows filesystem might logically be written to multiple locations on its disk, but they will all be stored together close together on the disk on the NetApp array. In VDI and OLTP environments, this is a good thing, because in order to access a file or record, you first access the MFT or index which then points to the data you’re after. Guess what ! because of the fact that all parts of the file and its metadata are all laid out close to each other, there is a very good chance that you won’t need to do a seek+settle to get the heads to the data portions resulting in much improved disk reads. In effect, this allows a FAS array to do inline physical defragmentation of guest.
Data ONTAP is able to combine this temporal locality of reference with a little publicized feature called a read-set. A read-set a record of which sets of data are regularly read together and is stored along with the rest of the metadata in WAFL. This provides a level of semantic knowledge about the underlying data that the readadhead algorithm uses to ensure that in most cases, the data has already been requested and read into read cache before VDI client sends down its next read request.
Secondly (and this really applies more to Database environments than it does to VDI, but I’ll include it here for the sake of completeness) there are techniques which completely address the SRARW issue..
1. WAR (woah woah woah,, what is it good for) ..
As it turns out this kind of WAR is good for quite a few workload types because it stands for “write after read”. This feature has been available since Data ONTAP 7.3.1, and when enabled for a volume, it senses when you’ve requested bunch of data that is logically sequential, figures out if it had to do an excessive number of random reads at the back end, and if so, finds a nice clean area to write this stuff out, so that the next time you do the same logically sequential read, it is nicely sequentially layed out in a physical sense. I’ve done some tests in hostile environments (a month of running10+ hours every day of completely random reads followed by a complete sequential integrity check of an exchange 2007 database), and the WAR option increased the sequential scan time by about 15%. A subsequent scan of the same database took 40% less time (and it probably would have been faster if I hand hit a client CPU bottleneck during the integrity check).
2. Regular reallocation scans.
These are recommended as a default best practice for database LUNs in the Data ONTAP administration guide, though it seems that nobody actually reads this friendly manual, so it still doesn’t seem to be common practice. These scans execute every night, and run a complete reallocation of only the “fragmented” blocks. Based on some experiments I did on a 3040 with 12 spindles, this works at about 100+ GB per hour, so for a 4TB database with a 2% daily change rate, a nightly reallocate would take about an hour. This might seem like an imposition, but if you’ve cut your nightly backup window by 8 hours due to cool snapshots and SnapVault/SnapMirror, then adding back an hour to optimise the performance isn’t a big ask. This also creates some nice clean free-space areas, which keeps the write performance nice and snappy. As a side effect, regular reallocations mean that any disks added to the aggregate will quickly get hot data evenly spread across them thereby improving read performance even more.
It should be noted that these two techniques don’t work with deduplicated volumes. If you believe you will be running a lot of single threaded sequential reads in your VDI environment, you should consider placing those workloads on a volume which does not have deduplication turned on, and possibly use one of the other single instancing technologies such as Vmware View in combination with one of the techniques described above.
As I said before, I’ve included those two points for the sake of completeness, but for VDI environments where the I/O profile is almost completely random, WAFLs default behavior of a data layout based on temporal locality of reference will give you better performance than a layout based on spatial locality of reference as used by traditional arrays.
Thats soooo random
At this stage it might be worthwhile noting that random reads and writes aren’t truly random , they are merely “non sequential”, there are few truly random things outside of the world of mathematics, storage benchmarks, and quantum physics. It is this that allows the fuzzy logic in Data ONTAP’s read-ahead algorithms to do their remarkable work. NetApp spent a lot of time and brainpower on creating and fine-tuning these, and I’m confident that they are unsurpassed by any other storage array. This is where I’d like to extensively quote another section out of Ruben’s excellent article with some additions of my own.
The NTFS filesystem on a Windows client uses 4 kB blocks by default. Luckily, Windows tries to optimize disk requests to some extent by grouping block requests together if, from a file perspective, they are contiguous [which the readset feature in Data ONTAP is built to recognise]. That means it is important that files are defragged [Except in Data ONTAP where WAFL has already stored these logically fragmented files physically close to each other thanks to the magic of temporal locality] ….. Therefore it is best practice to disable defragging completely once the master image is complete [Which might be a concern without the performance optimisations built into Data ONTAP] The same goes for prefetching. Prefetching is a process that puts all files read more frequently in a special cache directory in Windows, so that the reading of these files becomes one contiguous reading stream, minimizing IO and maximizing throughput. But because IOs from a large number of clients makes it totally random from a storage point of view, prefetching files no longer matters and the prefetching process only adds to the IOs once again. So prefetching should also be completely disabled. [however Data ONTAP effectively and transparently restores this performance enhancement thanks to the way readsets work with Data ONTAPs prefetch/readahead capabilities] If the storage is de-duplicating the disks, moving files around inside those disks will greatly disturb the effectiveness of de-duplication. That is yet another reason to disable features like prefetching and defragging. [not to mention that for the most part, that with Data ONTAP it’s completely unncecesary]
Aggregating Disk IOPs
Another thing that helps NetApp is the concept of aggregates which makes it a lot easier to recruit the collective IOPs of all the spindles in an array rather than having IOPs trapped and wasted within small RAID groups, in principal, its’ similar to the closely related concept of wide striping. It also globalises the pool of free blocks which made the write allocator’s job much easier. This combination of readsets, hyper-efficient writes and the ability to recruit a lot of spindles to the read workloads means that for most real world workloads, NetApp is as fast, if not faster than equivalently configured arrays from other vendors which was nicely shown in independently audited industry standard SPC-1 benchmarks .
What you might have heard …
For me though, one of the main proofs of the effectiveness of these techniques is that pretty much every “benchmark” run on our kit by our competitors tries to ensure that none of these features are used. I’ve seen things like using artificial 100% completely random workloads to ensure that readsets cant be used, unrealistically large working sets to ensure the maximum number of metadata reads, and really small aggregates, misaligned I/O and other non best practice configurations to make the write allocators’ job as hard as possible. It’s said that that all is fair in love and IT marketing, but the shenanigans that some vendors get up to discredit Data ONTAP’s performance architecture often goes beyond the bounds of professional conduct.
Moving right along
OK, now I have that off my chest, I can move on to the next part of my blog Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles where I’ll show how Data ONTAP can help reduce the storage costs for VDI to the point where you can afford to use world class shared storage without the availability and managability compromises involved with DAS and other forms of cut price storage.