Data Storage for VDI – Part 10 – Megacaches
More recently a range of products have come to the market that take advantage of the increasing affordability of non volatile memory (particularly SLC Flash), to create caching architectures that change the rules for modular storage (in no particular order)
- PAM-11 / FlashCache
- Sun 7000 Logzilla and Readzilla
- FalconStor Flash SAN accelerator (using flash modules from Violin)
- IBM EasyTier / Something to do with SVC
- EMC FAST Cache
- Atlantis Computing vScaler
- Nimble Storage
- Lots more to come …
While I’d love to go into the details of each of these and compare the features and benefits of each technology, a lack of time and detailed information makes this really hard to do. Also, as a general principal, I don’t think that its wise for an employee of one vendor to make a lot of assertions about another vendors technology. I have enough trouble keeping up with what’s happening at NetApp without trying to gain deep subject matter expertise with, for example, HP or EMC’s technology. Having said that I do think contrasting two different approaches can be useful, so for that reason I’ve decided to deviate from that principal, and will compare as diligently as I’m able NetApp’s FlashCache and EMC FASTCache.
I’ve included FlashCache for obvious reasons, there is already more than a Petabyte of it out there, I’ve been analyzing it for about a year now, and have I access to the engineering documentation,. I chose FAST Cache because being an EMC product means the marketing engine behind it will make it widely known and the engineering will be solid. The market presence and differing approaches of both of these technologies make them a fairly good yardstick against which the other mega-cache technologies will compare themselves..
Part of the reason I took so long to write this post was that I spent a fair time trying to characterise the likely performance benefits of a FASTCache solution, which as a competitor is a fairly dangerous exercise.. I’ve tried to be even handed and fact based when doing this and have disclosed where possible the sources of my information, however if you believe I’ve misrepresented the technology please let me know, this is not about vendor bashing, it’s about establishing what I hope is a fair basis of comparison.
Doing this in an even handed fashion was particularly hard because a lot publically available information is either incomplete, or somewhat contradictory. I know that is an industry wide problem, but this is one area where there seems to be a lot more marketing material than engineering substance. The main sources of my information were blog posts by EMC employees and integrators as well as an official EMC technical report, the details of which, and my takeaways from them are as follows.
How Fast is FAST for VDI ?
Chad Sakac quoted here in relation to the speed of writing data in various raid configuration “(I’ve added SSD with 6000 IOps as commented by Chad Sakac).” while I respect Chads comments to do with EMC’s integration with Vmware, I think he’s might be a little off here, especially given that this comment was made 6 months ago, long before FAST Cache was announced.
Mark Twomey (StorageZilla) says that EFD’s have no additional benefit for writes (I assume this applies mostly to Symmetrix which already does good write optimisation) quoted here where he says “The thing most people don’t understand about Flash is that writes aren’t really all that much faster to a good SSD than they are to a regular disk drive. And thus, predicting where writes are going isn’t an objective of FAST”
or Randy Loeschner who also works for EMC and seems to know his way around a database where he says on his blog “Solid State/Enterprise Flash Drives are similar in Write Performance to 15K Fibre Channel disks, but in READ scenarios are capable of 2500 or more READ IOPs.”
I also checked any available benchmarks and found the an EMC document that contained reasonably useful data, though even that seemed to contradict itself with regard to the number of IOPS you could get out of an EFD. Says that an Enterprise Flash Drive (EFD) can get 2,500 IOPS per drive, though without any details as to the latency or the I/O mix. Then further down it says that in a 50:50 read write 8K IOPS environment you can get 1057 IOPS per EFD at 12ms response time for reads and 24ms response time for writes without any additional help from the clarrion DRAM based write cache, or 1760 IOPS per EFD at 6ms response time for reads and 2ms response time for writes when the write cache is enabled.
I also found another informative post here at gotitsolutions.org
Which shows roughly 1100, 1500, and 2000 IOPS per drive for 100% random writes, 60:40 write read and 40:60 write read performance respectively without help from DRAM caching. Furthermore, I had a conversation with a colleague who’s opinion I respect, it appears that “FAST Cache does 64K blocks …[which means that EMC] claim 50% more speed overall.”.
Based on the above information, I think it would be reasonable to assume that a 6+1 EFD RAID group configured as FAST cache would allow for between 12,000 and 20,000 sub 5ms IOPS depending on the configuration and workload. Thats pretty good, but it’s not the “orders of magnitude” faster than spinning disk so often claimed, and nowhere near the performance of array cache.
The benefits of a write mega-cache
A write cache in our hypothetical 1000 user 12 IOPS per user and using 33:63 R:W VDI environment equates to about 30MiB/sec of random write activity or about 108 GiB per hour. a 6+1 RAID group of 146GB EFD drives provides about 822 GiB of usable cache space. If you split this 50:50 between read and write, this works out to about 4 hours of writes before you even begin to need to destage. This is the thing that differentiates a mega-cache from a standard cache is that it can absorb a sufficiently large number of changes to satisfy hours or possibly even entire business days’ worth of I/O. In addition a cache this large is almost certainly going improve the efficiency with which writes can go to the back end raid group. The extent to which is does this is dependent on many different factors. In some edge cases the additional improvement is marginal, in others it could be close to the kinds of efficiencies typically seen in a NetApp FAS array. In theory a 6+1 RAID-5 disk set combined with a large write cache could approach or even exceed the write efficiency of a 6+6 RAID-10 disk set.
The benefits of a read mega-cache
On the read side of the equation, mega-caches in the order of 250GiB+ have the advantage that they are able to store the majority of the active working set, especially in VDI environments where it is not unusual to see it offloading 80+% of the read I/O from the disks. This not only improves the latency of the I/Os from cache but also those that need to come from disk. The disk improvements come from reduced I/O contention, and the ability to make read-ahead more effective as detection of the read pattern which triggers the read-ahead functions happens while the data is being served from cache. It also allows the read-ahead algorithms to be more aggressive as the potential risks of reading in too much data and flushing out other useful data is mitigated by the much larger read caches.
The Net-Net is that mega-caches can significantly reduce the average latency for disk I/O even in spindle constrained environments, and the ability to handle peak loads is significantly improved.
NetApp really stoked the market for mega-caches when it released the PAM-II, now called flash-cache (I’m kind of sad they changed the name, there were lots of bad PAM puns like “Flash in the PAM” that few if any will now remember) . Unlike SSD/EFD based cache architectures, FlashCache connects to the storage controller via PCIe, and includes a NetApp created flash translation layer, some dedicated hardware acceleration and uses a driver which is tuned to the characteristics of all of this hardware. All of this results in a cache which is capable of hundreds of thousands of sub 2ms IOPS with shorter code paths and higher levels of CPU efficiency than is seen in SSD/EFD based caches.
Another thing that helps is cache awareness of FAS Deduplication and Flexclones, which in effect multiplies the effective size of the cache by the level of deduplication within the active dataset. For example if you are using deduplication for persistent desktop guest O/S images and seeing 95% deduplicatoin ratios (especially for the 2GB the core operating system portions of the image), your effective cache size is 20x larger. This means that even a modest FAS2040 with 4GB of ram can have an effective read cache of 50+GiB which comes in really handy during boot storms. For a 256GB Flash cache, using the same math, the effective cache size ends up being around 5TB ! Thats a best case situation, but the strange thing about VDI on NetApp is that best case scenarios just keep coming up over and over again which is what prompted me to start on this series of posts in the first place.
Isnt FlashCache just for reads ?
As good as FlashCache is, some commentators have quite correctly pointed out that this cache is read only, which is correct, but they then go on to make the incorrect conclusion to say that is does nothing for write performance. This might elicit a “Thank you Captain Obvious” from some, but yet again this is one of those things which like the sun revolving around the earth, is simple, understandable, full of common sense, and also happens to be wrong.
Flashcache + Dedup/Flexlclone + Realloc = High speed write cache.
If you’ve read through this entire series, you’ll might remember the following statement
“Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%”
This was on the assumption that the system was about 80% full and that the best allocation area was about 40% utilised. But what happens when the best allocation are is completely unallocated ? This question was already covered in the following blog post, though it would appear the author decided to take another job outside of NetApp (good luck at CommVault Mike 🙂 and his NetApp blog may get cleaned up at some later time, so I’ve taken the liberty to take an excerpt from it.
“For demonstration, I configured a single 3 disk NetApp aggregate (2 parity, 1 data) to demonstrate how much random write I/O I could get out of a single 1TB 7200 SATA drive .. The result is over 4600 random write IOPs with an average response time of 0.4ms. “
This was 1 SATA data drive (the other two were parity drives which dont add to write speeds), 4600 random writes IOPS . If you extrapolate this, 4 1TB SATA drives will give you you get about 3TB of usable storage and 18,400 IOPS. Woo Hoo ! 4 SATA drives from NetApp = 7 EFD drives from EMC, game over discussion closed .. right ?
Well, yes, under ideal circumstances, but the world is not a perfect place, and neither are the datacenters which inhabi it, even with VDI, so what might stop this from working outside of the unicorn farm ?
1. Those disks wont stay empty
True, but to be equal to the amount of write cache used in our 7 disk EFD cache (assuming a 50:50 between read and write cache) we could add fill those 4 SATA drives with 2.5TB of data, and still have an equivalent write caching capability
2. That freespace wont stay contiguous
True, but NetApp provides methods to re-arrange the freespace via the reallocate -A command (sometimes called segment cleaning). This option is particularly well suited to VDI environments where large burst writes are fairly typical and where optimising access for sequential reads is not generally considered a high priority.
3. There will be competition for read I/O
True, but single instancing technology and smart caching allows the majority of those reads to be served from cache.
But what about the real world ?
I plan to cover each one of these in some blogs on detailed peformance tuning for NetApp, but rather than delve even deeper into abstract theory, I’m going to pull some data and graphs from an existing 2000+ seat VDI deployment that uses FlashCache and Reallocate to manage some very bursty I/O patterns. The interesting thing about this particular implementation is that it is far from an “ideal” workload ad shows what can be done with a little bit of planning and some really smart storage controllers. In addition with a little luck and some persistence I’ll also pull up a far more modest lab environment and see exactly how much you can wring out of a NetApp controller on a tight budget.