Home > Performance, RAID, Virtualisation > Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles

Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles


The nice thing from my point of view is that because VDI’s steady state performance is characterized by a high percentage of random writes and high concurrency, the performance architecture of Data ONTAP has been well optimized for VDI for quite some time, in fact since before VDI was really  focus for anyone. As my dad said to me once, “Sometimes its better to be lucky than it is to be good” 🙂

As proof of this, I used our internal VDI sizing tools for

  • 1000 users
  • 50% Read, 50% Writes
  • 10 IOPS/second
  • 10GB single instanced (using FAS Deduplication) Operating system image
  • 0.5 GB RAM per Guest (used to factor the vSwap requirements)
  • 1 GB of Unique data per user (deliberately low to keep the focus on the number of disks required for IOPS)
  • 20ms read response times
  • WAFL filesystem 90% Full

The sizer came back with needing only 24 FC disks to satisfy the IOPS requirement on our entry level 2040 controller without needing any form of SSD or extra accelerators.

That works out to over 400 IOPS / 15K disk or about 40 users per 15K disk, 400% better than the 10 users per 15K RAID-DP spindle predicted by Ruben’s model. For the 20% Read 80% write example, the numbers are even better with only 18 disks on the FAS-2040 which is 555 IOPs or 55 users per disk vs. the 9 predicted by Rubens model (611% better than predicted). To see how this compares to other SAN arrays, check out the following table which outlines the expected efficiencies from RAID 5, 10, and 6 for VDI workloads.

Read IEF Write IEF Overall Efficiency at 30:70 R/W Overall Efficiency at 50:50 R/W
RAID-5 100% 25% 47.5% 62.5%
RAID-10 100% 50% 65% 75%
RAID-6 100% 17% 41.9% 58.5%
RAID-DP + WAFL 90% Full 100% 200-350% 230% 170%

The really interesting thing about these results is that as the workload becomes more dominated by write traffic, RAID-DP+WAFL gets even greater efficiencies. At a 50:50 workload the write IEF is around 240%, however at 30:70 workload the write IEF is close to 290%. This happens because random reads inevitably cause more disk seeks, whereas writes are pretty much always sequential.

Don’t get me wrong, I think Ruben did outstanding work, and something which I’ve learned a lot from, but when it comes to sizing NetApp storage by I/O I think he was working with some inaccurate or outdated data that led him to some erroneous conclusions which I hope I’ve been able to clarify in this blog.

In my next post, I hope to cover how megacaching techniques such as NetApp’s FlashCache can be used in VDI environments and a few specific configuration tweaks that can be used on a NetApp array to improve the performance of your VDI environment.

  1. July 20, 2010 at 12:45 am

    Did i read this correctly? 400 IOPS per Disk? 555 IOPS per disk ? If I look at your SPC-1 numbers on your 3170 you guys just cover about 270 IOPS per 15K disk and thats at 30ms response time with a 40/60 read/write mix and 224 drives to get that average!!! I’m also curious how you are running at 90% full when you’ve clearly pointed out that you are using a single instance of 10GB’s (FAS Deduplication) ?

    As my daddy use to say, “something doesn’t smell right” 🙂

    @StorageTexan

    • July 20, 2010 at 9:31 am

      Tomm (or should I call you Mr Texan ?), yep, you read it correctly an effective 400 IOPS per spindle for 50:50 and 555 for 70:30. The reason the per spindle figures are higher than the SPC-1 numbers is because the nature of the workload is so different.

      I’m pretty sure the SPC-1 workload is about 60% random reads and 40% random writes (I’ll double check), but the biggest difference comes from the fact that SPC-1 is based on an e-mail workload with very large working sets (which means lots of extra metadata reads), and truly random I/O with offsets and read lengths generated by a rand() function call which as I mentioned before tends to eliminate a lot of the intelligence available with read-sets. The SPC-1 benchmark we submitted was done on pretty much a “worst case” scenario where we hammered the system for a number of days with truly random writes to ensure there were few good allocation areas available.

      The 90% full figure came from a option you can set when using the sizing tool which tells it to base it’s modelling on a system with a certain level of freespace within the aggregate (on top of the 10% WAFL already reserves), by default it chooses 10% available freespace.

      This can happen when for example you use agressive space savings for the volume holding the operating system images, and then created NAS volumes for home directories and user data.

      VDI is a different kind of workload, and it was this that prompted my post, as always feel free to challenge the results if you dont think the logic stands up, I’m a big fan of “hostile” peer review

      • July 20, 2010 at 10:08 am

        I left this off my first post- I normally put this on all my postings:

        I work for Xiotech Corp.

        ohh, don’t call me tomm – you can call me Tommyt or I sort of like the Mr Texan maybe even just Tex 🙂 Ha!!

        So, I’m just not a big fan of “to cache” numbers. They are way too easy to inflate and difficult to validate. I’ve heard EMC claim 100,000’s of IOPS with a nice disclaimer of “To cache” which is most of the time not plausible. I also think that Storage Performance Council or someone should come up with a benchmark for VDI. Maybe VMware should have an ESRP type test that can be run by storage providers.

        So, I’m curious where you get your 10 IOPS per desktop number from? I’ve heard 20 to 30 for bootstorms. I prefer your number !!!

        @StorageTexan or Tommyt or Mr Texan or Tex or “All around good dude”

      • July 20, 2010 at 12:00 pm

        @StorageTexan, I got the numbers from “Understanding how storage design has a big impact on your VDI (UPDATED)”, which I referenced on the first post of this series.

        I am not a fan of “to cache” numbers either, they’re often silly and it really irks me when I see customers ask for them as for the most part they are completely irrelevant. I spent a fair bit of time outlining the effectiveness (or lack of) cache for VDI workloads, and I tried as best I could to demonstrate how a FAS array acheives the numbers at steady state by using a data layout based on temporal locality of reference rather than the more usual spacial locality of reference layouts used by most (all?) traditional storage arrays. Of course you cant do this effectively without a little bit of cache, but this is why I deliberately used one of our entry level storage controllers for my first worked example.

        Benchmarks are an interesting subject, and one which I intend to write about after I’ve finished this series, and the next one after it which will be focussed on reliability engineering and minimising MTBC (mean time before cockup). As much as HP sometimes’ annoy me, their underloved storage research division has some great real-world customer workload traces and methods of replaying them which they have in the past made available to storage researchers.

        I’d be really interested to see one of these for a VDI environment used as a performance corpus (much the same way there were standardised collections of files to evaluate the effectiveness of compression algorithms), rather than the usual mechanism of using random number generators to emulate non sequential workload patterns. The biggest problem however would be getting a signficant body of vendors to agree that they workload in question is representative.

        Utlimately I think it will take someone outside of the storage industry to come up with the workload and then allow customer demand to drive adoption. I think Vmware or Microsoft would be good candidates for this, but only time will tell.

  2. Tim Kresler
    July 20, 2010 at 2:07 am

    I’m curious where you got your write vs read numbers? Seems to me that most desktop users (at least anectodtally) read a lot more than they write. Have you done any testing on actual desktop users and tried to identify their actual usage stats, or are these all ballpark numbers? I’m also curious as to your 10IOPS/SEC number.

    Not trying to call you on anything, just more curious as to how you decided to assume the numbers when you started your analysis.

    Tim

    • July 20, 2010 at 7:16 am

      Tim,
      I got my numbers from “Understanding how storage design has a big impact on your VDI (UPDATED)”, which is well worth reading. Rubens’ numbers correlate with what I’ve seen and have heard from others. The first time I saw such high write percentages for a Win7 VDI deployment I said that they must have done their measurements wrong, turns out that a some of that traffic was due to the online defragger inside of Win7, but even when that was turned off the write percentages stayed high.

      Exactly why its like this is something of a mystery to me at the moment, my current conjecture is that a lot of the reads are being cached before they hit the storage layer. There is some more formal data and analysis that I’m hopng will be available soon, If and when I get hold of it, I’ll publish the findings or point to where they can be found.

      Feel free to call me on anything you dont think is right, I’ve made mistakes in the past and will do so again in the future, the nice thing about blog comments is that they improve what you write.

  3. Adriaan
    July 22, 2010 at 4:00 am

    John,
    I have to congratulate you on the quality and simplicity of this series of blogs – and this against most NetApp blogs striving to explain details.
    As you have a knack for making things clear – your IEF metric is very helpful – I have noticed with WAFL being write optimised – many IOPS with very low latency the effect is that the wallclock time for writes is reduced – leaving more time for reads – I once saw WAFL vs normal raids represented in pie charts which give a very visual Ah.Ha! moment

    Maybe you could do a series as part of this with the correct numbers? ie how the 555 IOPS is made up of x random read taking ..ms and y sequential write taking ..ms

    The way I have seen it WAFL spends so little backend time doing writes that the limitation of the array can almost be measured only on the read requirement

  4. July 28, 2010 at 4:16 pm

    Hi John,

    I’m reading this series with great interest, both because you’ve been of great assistance in the past and also having had a lot to do with some large scale VDI deployments.
    You have confirmed some of my current theories around how this stuff should be sized (especially around writes) and I certainly learned some stuff about WAFL that I didn’t know before…
    However, something I’ve found with VDI is that the storage must be able to cope with peak performance usage. This is generally boot or update times and the IOPs profile will be far higher at that point than the normal 10 IOPs allowed for. I know that the NetApp PAMII cards offer a solution for the read caching, but may not offer much assistance in a situation similar to the one you’ve described.
    In short, I would spec a system for this situation with more disks to cover those boot and upgrade storms, even if that much storage may not be required.

    David

    • July 29, 2010 at 1:12 pm

      David,
      you make an excellent point. Boot storms and virus checkers are reasonably easy to deal with via read cacheing like PAM, “Patch Tuesday” where there are massive updates that need to be applied across a wide range of machines can be more problematic. Part of the reason I’m taking so long with my next post on Megacaches is to characterise the effectivness of caching for peak load situations like this and the avaialable information on this is more anecdotal than I’d like.

      As a sneak preview of the kinds of things I’m looking at.

      1. It’s probably better to avoid the problem mass updates via patching through creative virtual desktop deployments methods, or by scheduling these known workloads to more “friendly” times (e.g. 2AM).

      2. In many ways a WAFL aggregate with 30%+ Freespace, from a write perspective can be thought of as a megacache in and of itself, especially when it has reasonably large areas of contiguous freespace for the write allocator to work it’s magic. In cases like that, 1500+ write IOPS per spindle are acheivable, though not neccesarily sustainable without subsequently cleaning up the allocation areas. In this respect its not dissimilar to SSD based caches architectures, though at 16TB per aggregate they are typically an order of magnitude larger than are seen elsewhere.

      It will probably take me until the end of the year to completely cover everything I’d like to write about, but in the mean time, I’d appreciate anything you come up with where practice doesn’t match the theory as it should help to improve the quality of the posts.

      Regards
      John

  5. Tore
    August 19, 2011 at 8:33 pm

    When is the next post?

    • October 9, 2011 at 8:22 pm

      I was waiting on some research to get completed before I wrote much more, that research has pretty much been done, but finding the time to write it up has been hard to find. I do intend to continue this series as soon as I can, thanks for the interest.

      Regards
      John

  1. No trackbacks yet.

Leave a Reply - Comments Manually Moderated to Avoid Spammers

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: