Home > Uncategorized > Data Storage for VDI – Part 10 – Megacaches

Data Storage for VDI – Part 10 – Megacaches


Megacaches

More recently a range of products have come to the market that take advantage of the increasing affordability of non volatile memory (particularly SLC Flash), to create caching architectures that change the rules for modular storage (in no particular order)

  • PAM-11 / FlashCache
  • Sun 7000 Logzilla and Readzilla
  • FalconStor Flash SAN accelerator (using flash modules from Violin)
  • IBM EasyTier / Something to do with SVC
  • EMC FAST Cache
  • Atlantis Computing vScaler
  • Nimble Storage
  • Lots more to come …

While I’d love to go into the details of each of these and compare the features and benefits of each technology, a lack of time and detailed information makes this really hard to do. Also, as a general principal, I don’t think that its wise for an employee of one vendor to make a lot of assertions about another vendors technology. I have enough trouble keeping up with what’s happening at NetApp without trying to gain deep subject matter expertise with, for example, HP or EMC’s technology. Having said that I do think contrasting two different approaches can be useful, so for that reason I’ve decided to deviate from that principal, and will compare as diligently as I’m able NetApp’s FlashCache and EMC FASTCache.

I’ve  included FlashCache for obvious reasons, there is already more than a Petabyte of it out there, I’ve been analyzing it for about a year now, and have I access to the engineering documentation,. I chose FAST Cache because being an EMC product means the marketing engine behind it will make it widely known and the engineering will be solid. The market presence and differing approaches of both of these technologies make them a fairly good yardstick against which the other mega-cache technologies will compare themselves..

Part of the reason I took so long to write this post was that I  spent a fair time trying to characterise the likely performance benefits of a FASTCache solution, which as a competitor is a fairly dangerous exercise.. I’ve tried to be even handed and fact based when doing this and have disclosed where possible the sources of my information, however if you believe I’ve misrepresented the technology please let me know, this is not about vendor bashing, it’s about establishing what I hope is a fair basis of comparison.

Doing this in an even handed fashion was particularly hard because a lot publically available information is either incomplete, or somewhat contradictory. I know that is an industry wide problem, but this is one area where there seems to be a lot more marketing material  than engineering substance. The main sources of my information were blog posts by EMC employees and integrators as well as an official EMC technical report, the details of which, and my takeaways from them are as follows.

How Fast is FAST for VDI ?

Chad Sakac quoted here in relation to the speed of writing data in various raid configuration  “(I’ve added SSD with 6000 IOps as commented by Chad Sakac).”  while I respect Chads comments to do with EMC’s integration with Vmware, I think he’s might be a little off here, especially given that this comment was made 6 months ago, long before FAST Cache was announced.

Mark Twomey (StorageZilla) says that EFD’s have no additional benefit for writes (I assume this applies mostly to Symmetrix which already does good write optimisation) quoted here where he says “The thing most people don’t understand about Flash is that writes aren’t really all that much faster to a good SSD than they are to a regular disk drive. And thus, predicting where writes are going isn’t an objective of FAST”

or Randy Loeschner who also works for EMC and seems to know his way around a database where he says on his blog “Solid State/Enterprise Flash Drives are similar in Write Performance to 15K Fibre Channel disks, but in READ scenarios are capable of 2500 or more READ IOPs.”

I also checked any available benchmarks and found the an EMC document that contained reasonably useful data, though even that seemed to contradict itself with regard to the number of IOPS you could get out of an EFD. Says that an Enterprise Flash Drive (EFD) can get 2,500 IOPS per drive, though without any details as to the latency or the I/O mix. Then further down it says that in a 50:50 read write 8K IOPS environment you can get 1057 IOPS per EFD at 12ms response time for reads and 24ms response time for writes without any additional help from the clarrion DRAM based write cache, or 1760 IOPS per EFD at 6ms response time for reads and 2ms response time for writes when the write cache is enabled.

I also found another informative post here at gotitsolutions.org

Which shows roughly 1100, 1500, and 2000 IOPS per drive for 100% random writes, 60:40 write read and 40:60 write read performance respectively  without help from DRAM caching. Furthermore, I had a conversation with a colleague who’s opinion I respect, it appears that “FAST Cache does 64K blocks …[which means that EMC] claim 50% more speed overall.”.

Based on the above information, I think it would be reasonable to assume  that a 6+1 EFD RAID group configured as FAST cache would allow for between 12,000 and 20,000 sub 5ms IOPS depending on the configuration and workload. Thats pretty good, but it’s not the “orders of magnitude” faster than spinning disk so often claimed, and nowhere near the performance of array cache.

The benefits of a write mega-cache

A write cache in our hypothetical 1000 user 12 IOPS per user and using 33:63 R:W VDI environment equates to about 30MiB/sec of random write activity or about 108 GiB per hour. a 6+1 RAID group of 146GB EFD drives provides about 822 GiB of usable cache space. If you split this 50:50 between read and write, this works out to about 4  hours of writes before you even begin to need to destage. This is the thing that differentiates a mega-cache from a standard cache is that it can absorb a sufficiently large number of changes to satisfy hours or possibly even entire business days’ worth of I/O. In addition a cache this large is almost certainly going improve the efficiency with which writes can go to the back end raid group. The extent to which is does this is dependent on many different factors. In some edge cases the additional improvement is marginal, in others it could be close to the kinds of efficiencies typically seen in a NetApp FAS array. In theory a 6+1 RAID-5 disk set combined with a large write cache could approach or even exceed the write efficiency of a 6+6 RAID-10 disk set.

The benefits of a read mega-cache

On the read side of the equation, mega-caches in the order of 250GiB+ have the advantage that they are able to store the majority of the active working set, especially in VDI environments where it is not unusual to see it offloading 80+% of the read I/O from the disks. This not only improves the latency of the I/Os  from cache but also those that need to come from disk. The disk improvements come from reduced I/O contention, and the ability to make read-ahead more effective as detection of the read pattern which triggers the read-ahead functions happens while the data is being served from cache. It also allows the read-ahead algorithms to be more aggressive as the potential risks of reading in too much data and flushing out other useful data is mitigated by the much larger read caches.

The Net-Net is that mega-caches can significantly reduce the average latency for disk I/O even in spindle constrained environments, and the ability to handle peak loads is significantly improved.

FlashCache

NetApp really stoked the market for mega-caches when it released the PAM-II, now called flash-cache (I’m kind of sad they changed the name, there were lots of bad PAM puns like “Flash in the PAM” that few if any will now remember) . Unlike SSD/EFD based cache architectures, FlashCache  connects to the storage controller via PCIe, and includes a NetApp created flash translation layer, some dedicated hardware acceleration and uses a driver which is tuned to the characteristics of all of this hardware. All of this results in a cache which is capable of hundreds of thousands of sub 2ms IOPS with shorter code paths and higher levels of CPU efficiency than is seen in SSD/EFD based caches.

Another thing that helps is cache awareness of FAS Deduplication and Flexclones, which in effect multiplies the effective size of the cache by the level of deduplication within the active dataset. For example if you are using deduplication for persistent desktop guest O/S images and seeing 95% deduplicatoin ratios (especially for the 2GB the core operating system portions of the image), your effective cache size is 20x larger. This means that even a modest FAS2040 with 4GB of ram can have an effective read cache of 50+GiB which comes in really handy during boot storms. For a 256GB Flash cache, using the same math, the effective cache size ends up being around 5TB ! Thats a best case situation, but the strange thing about VDI on NetApp is that best case scenarios just keep coming up over and over again which is what prompted me to start on this series of posts in the first place.

Isnt FlashCache just for reads ?

As good as FlashCache is, some commentators have quite correctly pointed out that this cache is read only, which is correct,  but they then go on to make the incorrect conclusion to say that is does nothing for write performance. This might elicit a “Thank you Captain Obvious” from some, but yet again this is one of those things which like the sun revolving around the earth, is simple, understandable, full of common sense, and also happens to be wrong.

Flashcache + Dedup/Flexlclone + Realloc  = High speed write cache.

If you’ve read through this entire series, you’ll might remember the following statement

“Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%”

This was on the assumption that the system  was about 80% full and that the best allocation area was about 40% utilised. But what happens when the best allocation are is completely unallocated ? This question was already covered in the following blog post, though it would appear the author decided to take another job outside of NetApp (good luck at CommVault Mike 🙂 and his NetApp blog may get cleaned up at some later time, so I’ve taken the liberty to take an excerpt from it.

“For demonstration, I configured a single 3 disk NetApp aggregate (2 parity, 1 data) to demonstrate how much random write I/O I could get out of a single 1TB 7200 SATA drive .. The result is over 4600 random write IOPs with an average response time of 0.4ms. “

This was 1 SATA data drive (the other two were parity drives which dont add to write speeds), 4600 random writes IOPS . If you extrapolate this, 4 1TB SATA drives will give you you get about 3TB of usable storage and 18,400 IOPS. Woo Hoo ! 4 SATA drives from NetApp = 7 EFD drives from EMC, game over discussion closed .. right ?

Well, yes, under ideal circumstances, but the world is not a perfect place, and neither are the datacenters which inhabi it, even with VDI, so what might stop this from working outside of the unicorn farm ?

1. Those disks wont stay empty

True, but to be equal to the amount of write cache used in our 7 disk EFD cache (assuming a 50:50 between read and write cache) we could add fill those 4 SATA drives with 2.5TB of data, and still have an equivalent write caching capability

2. That freespace wont stay contiguous

True, but NetApp provides methods to re-arrange the freespace via the reallocate -A command (sometimes called segment cleaning). This option is particularly well suited to VDI environments where large burst writes are fairly typical and where optimising access for sequential reads is not generally considered a high priority.

3. There will be competition for read I/O

True, but single instancing technology and smart caching allows the majority of those reads to be served from cache.

But what about the real world ?

I plan to cover each one of these in some blogs on detailed peformance tuning for NetApp, but rather than delve even deeper into abstract theory, I’m going to pull some data and graphs from an existing 2000+ seat VDI deployment that uses FlashCache and Reallocate to manage some very bursty I/O patterns. The interesting thing about this particular implementation is that it is far from an “ideal” workload ad shows what can be done with a little bit of planning and some really smart storage controllers. In addition with a little luck and some persistence I’ll also pull up a far more modest lab environment and see exactly how much you can wring out of a NetApp controller on a tight budget.

Categories: Uncategorized
  1. August 19, 2010 at 9:38 pm

    Disclosure – EMCer here.

    An excellent post. Now – on the SSD performance envelopes, the 2500 read IOps value (which is used commonly in many EMC docs) strikes me as very, very conservative. It’s our internal conservative assumption (orders of magnitude lower than the drive’s rated values) because we tend to be very conservative. Also, it’s not accurate to say “they don’t have a write benefit”. It’s more accurate to say “write latency is generally a cache (DRAM/NVRAM) response unless there is a forced flush condition, where write latency is the backend latency. Flash generally supports similar (though still higher) write IOPs, but the write latency for any given write is lower with flash than with a traditional magnetic media”.

    Second – in the EMC FAST Cache use case, there is a very substantial effect of the write cache – since latency is several orders of magnitude lower than a 15K RPM spindle (microseconds in Flash vs. milliseconds to disk) though several orders of magnitude higher than DRAM or NVRAM (nanoseconds vs. microseconds in Flash) , when data is cached in FAST Cache it does act as very large extension of system write and read cache.

    The argument can be made (and you make it), that the array architectures are very different (and they are), such that write caching benefits EMC’s more and would have a lesser effect in NetApp’s case (which is governed by NVRAM and the effort to journal contiguous writes) – and that seems fair to me, but it’s notable that for EMC customers, the fact that FAST Cache can massively increase write cache is a benefit for them. It will be interesting to see independent customer and other analysis of these in the wild.

    Third – the question of the interconnect (in a disk enclosure vs in a PCIe interface) is an interesting one. Trace the latencies.

    (PCIe based SSD/Flash)
    – CPU to PCIe bus – nanoseconds
    – PCIe bus to flash controller – nanoseconds
    – Flash – microseconds.

    (flash sitting in a disk enclosure)
    – CPU to bus – nanoseconds
    – PCIe bus to backend controller – nanoseconds
    – backend controller to flash controller – nanoseconds
    – Flash – microseconds.

    So – PCIe based flash in the storage controller will have several nanoseconds lower latency in an end-to-end use case that’s measured in microseconds. That’s a .00x (three orders of magnitude) rounding error.

    There are four big benefits that offset that fractional latency difference:

    1) cache doesn’t need to rewarm after a controller failure or rolling non-disruptive upgrade. This is immaterial in some use cases (VDI) because the cache warms fast, and perhaps that period of “you see the backend performance directly” is acceptable. There are other use cases (database running on SATA fronted by mega cache architecture) where there is a very pronounced “warm vs. cold cache” performance delta.

    2) the cache can be twice as effective – as it’s shared across both controllers.

    3) MOST importantly – it makes adding cache really really easy – which represents a big economic benefit. As opposed to opening up the storage controller to add initial or incremental flash, you just pop in a SSD. This means existing customers can just non-disruptively add. It means that you can start with a smaller amount, and add more later (leveraging the plummenting cost of Flash – any time you can buy more later, well – 6 months translates to a 30%-50% lower price).

    Again – a great post, thank you for this addition to the dialog!

    • August 25, 2010 at 10:58 pm

      Chad,
      thanks for the kind words, I’d agree that 2500 read IOPS from a single SSD (assuming 100% read) is very very conservative, the gotitsolutions.org blog I referenced shows about 5500 in a pure read environment. The thing I’ve been trying to emphasize in this series of posts is that the nature of the workload can dramatically change the number IOPS per “drive” (or is device a better word these days ?), and my private hope is that they will lift the level of information given by all vendors around this important technology so that we (as engineers), have more engineering data and less hyperbole. My whacky idea is that we can then compete on our merits and that customers can make informed decisions about which technology fits their requirements the best (call me a dreamer).

      Having said that, I’m still not sure that statements like

      “In the EMC FAST Cache use case, there is a very substantial effect of the write cache – since latency is several orders of magnitude lower than a 15K RPM spindle”

      are entirely accurate

      The inference seems to be that flash has latency of “several orders of magnitude” lower than disk. It might be so with raw flash reads, but once you package those little NAND chips into EFDs and access them through flash translation layers via a SCSI command set, you barely get a single order of magnitude better for 100% read workloads and not even half of that for a 60:40 write:read workload ratio. Dont get me wrong, they are faster, but not “several orders of magnitude”.

      I’m basing my assumptions on the gotitsolutions.org benchmarks, which were done without the additional help of DRAM based caching, or using 64K blocks, so I’m sure there will be additional benefit there. How much is debatable, and I’m looking forward to seeing some more benchmarks now that this long talked about technology is actually available. I’m also kind of interested to see if the new FLARE code will disable the Clarrion DRAM write cache on failover events, or if EMC believes its customers should take the risk of losing data in a rolling failure. Though at the risk of spreading FUD, I should point out that controller failures these days are exceeding rare events, so it’s probably a moot point.

      In my mind, the interesting thing about the FlashCache technology is not whether the latency is measured in nanoseconds or microseconds (which is kind of silly, because even the most powerful arrays wont acknowledge a write in under 250 microseconds in any case). What is interesting is that the performance of a single 256GiB FlashCache card can be measured in hundreds of thousands of IOPS, all generally serviced in less than 2ms. For relatively small very hot workloads, thats something that is very hard to match. For larger workinsets, the caching algorithms effectively shares this IOPS density surprisingly well across existing spinning disk to lower the average latency for both reads and writes. This is not to say that this approach is intrinsically superior in all cases, but for many storage workloads generated by virtualised environments, especially for VDI, a modest amount of Flashcache goes a long long way.

      I think the next year or two will bring a lot of interesting ways of combining radically different storage technologies (not just the various kinds of spinning disk, DRAM and Flash we have today). Whether its called “fully automated storage tiering”, or “dynamic storage tiering”, or “really cool storage mashups”, or whatever, I strongly believe the majority of the value will come from megacache based approaches at many layers in the I/O path.

      NetApp led the way with PAM, Flexcache and now FlashCache, EMC and (eventually ??) the other storage vendors are beginning to follow suit. There are times I wish NetApp pre-announced technology to the same extent that EMC has done recently with FAST (how long has it been now, a year ?), because then I could talk about the stuff I know is coming down the pipe, but until then its going to be interesting to see how quickly everyone else will scrabble to catch up to where we are today.

      In closing, I’d like to say thanks for your comments, and extend the request that if you think I’ve been unfair or misleading at any point, please let me know. While a certain amount of “point scoring” is inevitable (not to mention fun) between competing vendors on these kinds of posts, my genuine desire is to increase understanding and raise the level of debate.

      Regards
      John Martin

  2. August 20, 2010 at 2:03 am

    The 2500 IOPS number estimates extreme read/write workloads and is used as the floor number for anyone who doesn’t want to do the maths. If performance tuning isn’t a high priority you can assume 2500 and you won’t find yourself painted into a corner if your workload turns out ot be more extreme than you thought.

    If performance tuning is a priority and you want a more accurate number then you have to consider IO size and level of concurrency in your workload and then run the numbers.

    Without optimisation I’ve managed to get more than 5.6K IOPS meaning Chad’s 6000 IOPs number is easily within range.

    • August 25, 2010 at 11:24 pm

      Wow,
      comments from both Mr Sakac and Mr Twomey, I am genuinely honoured, thank you. I agree that conservative rules of thumb like this are pretty good ways of talking about this stuff to people who dont have time to do the math. My guestimated values after a boatload of research and conjecture came out to between 1500 and 2800 IOPS depending on the workload, so personally I would have put it at closer to 2000, especially for VDI workloads. Nonetheless you are the ones with access to the EMC engineering documents so I’ll assume your rules of thumb are fair and reasonable, though I’m curious to know what assumptions you make about the workload to come up with that figure. I’d also be interested to know if that 5.6K IOPS figure you mentioned was for a 100% read ? It’s really close to the 5543.46 IOPS/EFD in the gotitsolutions.org 100% read test. If so, wouldn’t that make 6000 IOPS/EFD kind of an extreme edge case ? Then again as I said in my reply to Chad, now that FAST2 is finally released, it will be interesting to see how well your engineers have done their jobs, theres a lot of optimisation that can be done with some smart software and a little bit of DRAM.

      In closing I’m also cognizant of the fact that you might believe that I have taken your comments about SSD’s and write effectiveness out of context, if so please let me know.

      Regards
      John Martin

  3. Andrew Miller
    August 23, 2010 at 12:50 pm

    Absolutely fascinating series….greatly helpful in going deeper on NetApp (and even EMC a bit as well)….looking forward to future blog posts.

  4. Andrew Miller
    August 23, 2010 at 12:54 pm

    And….I can also confirm that I’ve seen some crazy high IOPs numbers from SATA disk at times (watching via Performance Advisor).

  5. August 24, 2010 at 8:57 am

    I think the point vs EMC FAST Cache is that, when doing a lot of writes, you’ll get about 1,100 random 4K writes per STEC SSD (based on 100% write benchmarks). Which is not bad vs a FC drive but for a system that has, say, 700 drives and a very heavy write workload, how does it work then?

    Too little info is available besides the fact it’s not organized in 4K blocks but rather 64K – which drops efficiency but might help with speeds.

    I can see it helping on systems with a lot of SATA…

    D

  6. Andrew Miller
    August 24, 2010 at 12:09 pm

    And….this is somewhat besides the point but I’d be curious to hear about caching in other arrays (EMC I guess mostly since it’s the conversation topic but curious to learn more in general here — maybe 3PAR too since they’re all in the news ;-). I’ve heard and read a decent amount over the years about how NetApp’s cache+filesystem+NVRAM+disk usage is a competitive differentiator (with this series of posts as some of the best technically detailed reference I’ve read) but not much about other array vendors beyond “cache is good” and “we have cache too”. 😉

    That doesn’t necessarily have to be in the comments here (not trying to hijack the post….got to get my own blog started soon)….but links to other posts/articles would be fantastic (speaking as a pre-sales and post-sales SE for a VAR that carries multiple storage vendors and likes to have deep enough technical knowledge to recommend the right solution for the right business problem….or at least present in sufficient detail to let the customer make an honest choice).

  7. August 24, 2010 at 10:41 pm

    Andrew,

    NetApp always has been extremely open regarding how our systems operate – we provide detailed docs explaining the patents (and more) to anyone that cares, and all engineers understand the inner workings well.

    Most other vendors don’t really show the inner workings – I won’t speculate on the reasons.

    For instance, I understand EMC is moving to a filesystem-based box (if they haven’t already but not telling anyone). However, you see no mention of that (maybe because they kept telling everyone how evil filesystems are). Maybe a necessary evil? 🙂

    I think it’s probably down to the difference between a car maker that produces a 2-liter engine that’s as powerful as the competitor’s 5-liter one yet burns less fuel.

    The guy making the 2-liter engine will probably have a ton more documentation explaining the coolness of the engineering that allows such marvels.

    The guy making the 5-liter engine will usually just say “bigger is better and always has been, no need for techie wizardry, ’nuff said”.

    D

  8. October 9, 2011 at 8:27 pm

    I really wish I knew what you just said … 🙂

  1. November 10, 2010 at 10:15 am
  2. January 28, 2011 at 12:23 pm
  3. June 30, 2011 at 12:12 am
  4. July 20, 2011 at 10:05 am
  5. August 22, 2012 at 6:47 am

Leave a Reply - Comments Manually Moderated to Avoid Spammers