Data Storage for VDI – Part 3 – Read and Write Cachin
Before I talk about the efficiencies of various RAID configurations, I’d like to get the potentially contentious subject of cache efficiency out of the way. According to Rubens each desktop will generate about 7 random write IOPS. If we work with a modest number of 28 15K spindles in a RAID-5 configuration, then these spindles could support about 235 desktops worth of write I/O. While write caching helps with performance, this much I/O will fill 1GB of write cache in less than three minutes. Once the cache is full, writes will only be as fast as your back end IO. Rueben addresses this pretty well when he says.
The fact is that when the number of writes remains below a certain level, most of them are handled by cache. Therefore it is fast; much faster than for reads. This cache is, however, only a temporary solution for handling the occasional write IO. If write IOs are sustained and great in number, this cache needs to constantly flush to disk, making it practically ineffective. Since, with VDI, the large part of the IOs are write IOs, we cannot assume the cache will fix the write IO problems, and we will always need the proper number of disks to handle the write IOs
Again, when it comes to traditional SAN arrays he’s right on the money here, but lacks a little bit on how write cache can improve write throughput. While write cache does help for transient workload spikes, in most modular arrays with VDI workloads consisting of a consistently high level of random writes, the main role of the write cache is to allow multiple RAID operations to be coalesced into a single operation. This is done by writing more than one block into the same RAID stripe, which offsets the parity IO against more than one data IO. A good description of how write coalescing works can be found in this IBM RAID Patent. How well this works in a traditional SAN array depends on two main factors
- The amount of sequentiality in the write data stream
- The size of the active working set (e.g. if all of the blocks are being written to pretty much the same place on the same LUN write coalescing will work well, if they’re randomly placed across the LUN and across different LUNs, its effect is limited).
While I don’t have hard metrics to hand for this, anecdotal evidence suggests that the working set sizes for writes in VDI deployments are about 5% of the allocated capacity. Furthermore, while there might be good sequentiality at the beginning of VDI deployments this would almost certainly decrease over time as free space within the virtual desktop’s filesystems fragments. Based on the research I’ve done it would appear that write coalescing with traditional styles of write caching only seems to the improve the write IEF by less than 1%, though I’d be happy to hear if anyone has solid evidence to suggest otherwise
Because there is no RAID overhead for reads in RAID-0, 10, 5 and 6, they should be almost 100% efficient, probably more so depending on the following three factors
- The size of the read cache
- The size of the active data set (i.e. how much of the data is being accessed in a short period of time)
- The sequentiality and/or predictability of the reads
Excluding megacaching products such as NetApp’s FlashCache, modular arrays have cache sizes between 1GiB and 32GiB, and for the most part these caches are statically assigned to either read or write. Given the heavy write workloads created by VDI environments, its likely that the majority of that cache will be allocated to writes, with only one or two GiB allocated to reads. With NetApp FAS, the cache is dynamically balanced between read and write caches depending on the current workload state. Because Data ONTAP uses write cache so efficiently in most cases, more than 70% of the cache ona FAS Array is used to accelerate read operations. Depending on the model, this works out to about 2+GiB at the low end, up to 40+GiB at the current top of the range, which is about as good as it gets in the storage industry.
Estimating the size of the active data set has at times been a subject of considerable debate amongst my colleagues. This metric is an important component of the NetApp’s general purpose “custom sizer”, but outside of things like Oracle Statspacks, its hard to get accurate information on this for any given workload. As a result, storage designers generally resort to conservative “rules of thumb” also known as SWAGs or sophisticated wildass guesses. The SWAGs I’ve used in the past for OLTP databases its typical that only about 10-20% of the entire database is active at any given time, for Exchange 2003 it was about 70-90% and CIFS home directory fileshares are generally less than 10%. Based on conversations with my colleagues who have had more experience with VDI deployments than I have, 5% of the allocated storage seems to be a pretty good value to use for the active working set of VDI in its steady state.
For example, lets say that there are 1000 virtual desktops, each with 10GBs of allocated storage, you’ll have about 10TB of allocated storage. 5% of this is 512GB of active working set. This means that for every GB of read cache you have in your storage array, you’ll be caching about 0.2% (zero point two percent) of the reads. This means that in most cases that read cache will offload less than 1% of the total reads coming from the back end
One way around this is to use single instancing techniques such as FAS Deduplication or linked clones to reduce the size of the active working set for the operating system and application executable portions of the workload. While this approach has shown to be particularly effective during bootstorms where the read working set is limited to a few GB of operating system, application executables, I wouldn’t expect it to help nearly as much for the steady state workload. I’d also expect that the way in which roaming profiles are setup might also affect this, however this kind of conjecture rapidly moves outside of my area of expertise. I should also note at this stage that I am deliberately excluding the use of FlashCache and other forms of “mega caches” which I’d cover in a subsequent blog post.
Another salient point for consideration by designers putting VDI workload on arrays from traditional SAN vendors is that for most VDI deployments, the workload is far more like a NAS workload than a SAN workload. This is because VDI workloads are characterised by high levels of concurrency with hundreds of simultaneous overlapping read requests at different points in the data sets whereas typical SAN workloads have much lower levels of concurrency, and smaller, hotter, more sequential data. Even if these reads coming from the various virtual desktops are sequential, because of the way they are requested by each virtual desktop, they end up looking like a bunch of random reads from the point of view of the LUN in the array.
So what is the key point I’ve been trying to make ? To summarise it down to one line
“The read and write cache in traditional modular arrays are too small to make any significant difference to the read and write efficiencies of the underlying RAID configuration in VDI deployments“.
Which is almost kind-of-sort-of good from my point of view, because it makes my next entry Data Storage for VDI – Part 4 – The impact of RAID on performance much easier to write.