Home > Performance, Virtualisation > Data Storage for VDI – Part 8 – Misalignment

Data Storage for VDI – Part 8 – Misalignment


If you follow NetApp’s best practice documentation all of the stuff I talked about works as well, if not better than outlined at the end of my previous post. Having said that it’s worth repeating that there are some workloads that are very difficult to optimize, and some configurations that don’t allow the optimization algorithms to work, the most prevalant of which is misaligned I/O.

If you follow best practice guidelines (and we all do that now don’t we …) then you’ll be intimately familiar with NetApp’s Best Practices for File System Alignment in Virtual Environments. If on the other hand you’re like pretty much everyone that went to the the Vmware course I attended, then you may be of the opinion that it doesn’t make that much of a difference. I suspect that if you I asked your opinion about whether you should go to the effort to ensure that your guest O/S partions are aligned, your response would probably fall into one of the following categories

  1. Unnecessary
  2. Not Recommended by VMware (They do, but I’ve heard people say this in the past)
  3. Something I should do when I can arrange some downtime during the Christmas holidays
  4. What you talking about Willis ?

If there is one thing I’d like you to take away from this post, it is the incredible importance of aligning your guest operating systems. After the impact of old school backups and virus scans, it’s probably the leading cause of poor performance at the storage layer. This is particularly true if you have minimized the number of spindles in your environment by using single instancing technologies such as FAS deduplication.

Of course this being my blog, I will now go into painful detail to show why it’s so important, if you’re not interested or have already ensured that everything is perfectly aligned, stop reading and wait until I post my next blog entry 🙂

Every disk reads and writes its data in fixed block sizes, usually either 512 or 520 bytes which effectively stores 512bytes user data and 8 bytes of checksum data. Furthermore  the storage arrays I’ve worked with that get a decent number of IOPS/spindle all use some multiple of these 512 bytes of user data as the smallest chunk that it stored in cache, usually 4KiB or some multiple thereof. The arrays then perform reads and writes of data using to and from disks using these these chunks along with the appropriate checksum information. This works well because most applications and filesystems on LUNs / VMDKs / VHD’s etc also write in 4K chunks. In a well configured environment, the only time you’ll have a read or more importantly a write request that is not some multiple of 4K is in NAS workloads, where overwrite requests can happen across a range of bytes rather than a range of blocks, but even then it’s a rare occurrence.

Misalignment of I/O though causes a write from a guest to partially write to two different blocks which explained with pretty diagrams in Best Practices for File System Alignment in Virtual Environments, however that document doesnt quite stress how much of a performance impact this can have when compared to niceley aligned workloads, so I’ll spend a bit of time on this here.

When you completely overwrite a block in its entirety, an arrays job is trivially easy,

  1. Accept the block from the client and put it in the one of the write cache’s block buffers
  2. Seek to the block you’re going to write to
  3. Write the block

Net result = 1 seek + 1 logical write operation (plus any RAID overheads)

However when you send an unaligned block, things get much harder for the array

  1. Accept a block worth of data from the client, put some of it in one of the block buffers in the arrays write cache, put the rest of it into the adjacent block buffer. Neither of these block buffers will be completely full however, which is bad.
  2. If you didn’t already have the blocks that are going being overwritten in the read cache, then
    1. Seek to where the two blocks start
    2. read the 2 blocks from the disk to get the parts you don’t know about
    3. Merge the information you just read from disk / read cache with the blocks worth of data you received from the client
    4. Overwrite the two blocks with the data you just merged together

Net result = 1 seek + some additional CPU + double write cache consumption + 2 additional 4K reads, and one additional 4K write (plus any RAID overheads) + inneficient space consumption.

The problem as you’ll see isn’t so much a misaligned write as such, but the partial block writes that it generates. In well configured “Block” environments (FC / iSCSI), you simply won’t ever see a partial write, however in “File” environments (CIFS/NFS) environments, partial writes are a relatively small, but expected part of many workloads. Because FAS arrays are truly unified for both block and file, Data ONTAP has some sophisticated methods of detecting partial writes, holding them in cache, combining them where possible, and committing them to disk as efficiently as possible. Even so, partial writes are really hard to optimize well.

There are many clever ways of optimizing caching alogrithms to mitigate the impact of partial writes, and NetApp combines a number of these in ways that I’m not at liberty to disclose outside of NetApp. We developed  these otptions because a certain amount of bad partial write behavior is expected from workloads targeted at a FAS controller, and much like it is with our kids at home, tolerating a certain amount of “less than wonderful” behavior without making a fuss allows the household to run harmoniously. But this tolerance has its limit and after a point it needs to be pulled into line. While Data ONTAP can’t tell a badly behaved application to sit quietly in the corner an consider how its behavior is affecting others, it can mitigate the impact on partial writes on well behaved applications.

Unfortunately environments that do wholesale P2V migrations of WinXP desktops without going through an alignment exercise, will almost certainly generate large number of misaligned writes. While Data ONTAP does what it can to maintain the highest performance it can under those circumstances, these misaligned writes much harder to optimise, which in turn will probably have a non-trivial impact on the overall performance by multiplying the number of I/O’s  required to meet the workload requirements.

If you do have lots of unaligned I/O in your environment, you’re faced with one of four options.

  1. Use the tools provided by NetApp and others like VisionCore to help you bring things back into alignment
  2. Put in larger caches. Larger caches, especially megacaches such as  FlashCache means the data needed to complete the partial write will already be in memory, or at least on a medim that allows sub millisecond read times for the data required to complete partial writes.
  3. Put in more disks, if you distribute the load amongst more spindles, then the read latency imposed by partial writes will be reduced
  4. Live with the reduced performance and unhappy users until your next major VDI refresh

Of course the best option is to avoid misaligned I/O in the first place by following Best Practices for File System Alignment in Virtual Environments. This really is one friendly manual that is worth following regardless of whether you use NetApp storage or something else.

To summarise – misaligned I/O and partial writes are evil and they must be stopped .

  1. July 21, 2010 at 1:00 pm

    Some vendors will tell you that disk alignment is necessary because each track has 64 sectors blah blah blah. I love that story, could you tell me another fairy tale please? You see, the last time there were tracks with a fixed 64 sectors per track, disk drives held a maximum of 8.4GB.

    The truth is that disk alignment today has everything to do with cache architecture. This article has gone a long way toward explaining how that works for NetApp. How about other vendors? Does anyone care to chime in?

    JohnFul

    • Sebastian Goetze
      July 23, 2010 at 5:36 am

      @JohnFul:
      It’s because of the historic legacy of the first *63* sector track containing the MBR which we can still find in many partitioning schemes. Then the whole partition and it’s blocks will all be ‘1 sector off’.

      Read the ‘Best practices for File System Alingment…’ *please*.
      It’s the same problem for every storage vendor not writing 512 Byte blocks, IAW all I know. The cache usually caches blocks, usually 4K blocks, usually just like they were/will be written to disk. Therefore it’s more of a file/disk/4K block alignment problem than a cache architecture problem…

      Sebastian

  2. danieladeniji
    April 3, 2013 at 10:16 am

    RJ:

    Thanks for posting this beautiful blog entry.

    I was brought to this page via your response in http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

    Your understanding is truly astounding and worth re-reading (many times over).

    Daniel

    • April 3, 2013 at 11:21 am

      Thanks Daniel,
      its worth noting that since I wrote this almost three years ago, a lot of work has been done to alleviate the partial write pain for NetApp customers. ONTAP 8.1.1 in particular has a completely revamped partial write code, plus some warning mechanisms (that still need some work to avoid false positives when writing log files). This is one of those things I wish I had the time and clarity around non-disclosure to write about, but right now, my head is full of software defined networking and the implications this has for storage and datacenter architecture over the next few years.

      Once again, thanks for the comment.

      Regards
      John

      • danieladeniji
        April 3, 2013 at 12:58 pm

        Thanks for the good and positive feedback.

        My original blog post was just trying to ensure that I had followed best practice per MS Windows Luns that I provisioned. And, so I documented the NetApp Filer side of the validation @ http://danieladeniji.wordpress.com/2012/12/13/netapp-lun-aligning/

        This morning someone pointed out that I was wrong in my understanding of what the “histogram percentage” actually means when one issues ” lun alignment show”.

        At the back of my mind I know that Microsoft is a bit pointed when it says :

        http://technet.microsoft.com/en-us/library/dd758814(v=sql.100).aspx

        “Other vendors claim that partition alignment is not a required optimization. For example, one vendor states that partition alignment is not necessary, adding that it “neither enhances nor detracts from [SAN] sequential performance”. The claims are intriguing, but corroborating data is lacking or in dispute. The statement explicitly cites sequential I/Os, but it fails to address random I/Os, optimal performance of which is important for OLTP databases and Analysis Services databases”

        And, so I went back and did a bit more review. And, that work is documented http://danieladeniji.wordpress.com/2013/04/02/technical-microsoft-sql-server-datafiles-log-file-write-patterns/.

        The good thing about writing is that it forces us to read dozens of blog entries as we often find that vendors reference documentation is a bit dry and does not take into account “episoderial” (working) knowledge.

        People reach new understandings all the time and I liked your closing argument. And so I took it to wrap things up.

        I know this is not a religions blog, but I must leave you with this words:

        Leviticus 23:22

        “‘When you reap the harvest of your land, do not reap to the very edges of your field or gather the gleanings of your harvest. Leave them for the poor and for the foreigner residing among you. I am the LORD your God.'”

        Thanks for sharing your deep knowledge with us in needs.

        – Daniel

      • April 3, 2013 at 1:09 pm

        I’d never thought of that passage in reference to gifting knowledge to the community, its given me something to think about for which I thank you. For the most part I write to gain clarity, the act of condensing your experience for the sake of teaching others helps me to keep focus on the important details, it’s a double bouns when someone else gets something out of it (I dont actually expect anyone to read my blog 🙂 )

  1. July 21, 2010 at 12:36 pm
  2. April 3, 2013 at 9:59 am

Leave a Reply - Comments Manually Moderated to Avoid Spammers

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: