Logical Shift: October 2011

While not unique to virtualization, it generally doesn't cause much of a problem until you consolidate a whole bunch of poorly setup partitions onto one array that things tend to go from all right to a really bad day. The fundamental problem is a mismatch between where the OS places data and where the storage array ultimately keeps it. Both work in logical chunks of data and both present a virtual view of this to the higher layers. There are two specific cases that I'd like to address, one that you can fix, and one that you can only manage.

Storage Alignment
Lets start with a simple illustration of the problem.

Most legacy operating systems (Linux included) like to include a 63 sector offset at the beginning of a drive. This is a real problem as now every read and write overlaps the block boundaries of a physical array. It doesn't matter if you are using VMFS or NFS to host a data store, it's the same problem. Yes an NFS repository will always be aligned, but remember this is a virtual representation to the OS, which happily messes everything up by offsetting its first partition.

Alignment is bad enough when we read. The storage array will pull two blocks when one is read, but it is of particular importance when we write data. Most arrays use some sort of parity to manage redundancy, and if you need to deal with two blocks for every one write request, the system overhead can be enormous. It's also important to keep in mind that every storage vendor has this issue. Even a raw, single drive can benefit from aligned partitions, especially when we consider most new drives ship with a 4KB sector size called Advanced Format.

The impact to each vendor will be slightly different. For example, EMC uses a 64KB block size, so not every write will be unaligned. NetApp uses a 4KB block, which means every write will be unaligned but they handle writes quite a bit differently as the block doesn't have to go back to the same place it came from. Pick your poison.

As you can see, when the OS blocks are aligned, everything through the stack can run at its optimal rate, where one OS request translates to one storage request.

Correctable Block Alignment
Fortunately most modern operating systems have recognized this problem and there is little to do. For example Windows 2008 now uses a 2048 cylinder offset (1MB) as do most current Linux distributions. For Linux, it is easy enough to check.

# fdisk -lu /dev/sdb

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x77cbefef

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  3907029167  1953513560   83  Linux

As you can see from this example, my starting cylinder is 2048. After that everything will align, including subsequent partitions. The -u option tells fdisk to display cylinders. I generally recommend this option when creating the partition as well although this seems to be the default for fdisk 2.19. You can also check your file system block size to better visualize how this relates through the stack. The following shows that my ext4 partition is using a 4KB block size:

# dumpe2fs /dev/sdb1 | grep -i 'block size'
dumpe2fs 1.41.14 (22-Dec-2010)
Block size:               4096

Uncorrectable Block Alignment
VMware has come out with a nifty utility I first saw in VDI (now called View) and later placed into their cloud offering called a linked clone. Basically it allows you to create a copy of a machine using very little disk space quickly because it reads common data from the original source and writes data to a new location. Sounds a lot like snap shots doesn't it?

Well the problem with this approach is that every block written requires a little header to tell VMware where this new block belongs in the grand scheme of things. This is similar to our 63 cylinder offset but now for every block, nice. It's a good idea to start your master image off with an aligned file system as it will help with reading data but doesn't amount to much when you write. And does a windows desktop ever like to write. Just to exist (no workload), our testing has shown windows does 1-2 write IOs per second. Linux isn't completely off the hook, but generally does 1/3 to 1/2 of that and isn't that common with linked clones yet as it isn't supported in VDI but will get kicked around in vCloud.

Managing Bad Block Alignment

When using linked clones, there are a few steps you can take to minimize the impact:

Refresh your images as often as you can. This will keep the journal file to a minimum and corresponding system overhead. If you can't refresh images, you probably shouldn't be using linked clones.
Don't turn on extra features for those volumes like array based snapshots or de-duplication. The resources needed to track changes for both of these features can cause significant overhead. Use linked clones or de-dupe, not both.
Monitor your progress on a periodic basis. I do this 3 times a day so we can track changes over time. If you can't measure it, you can't fix it.

In the future VAAI is promised to save us by both VMware and every storage vendor that can spell. It's intent is to perform the same linked clone api call but let the storage array figure out the best method of managing the problem. I've yet to see it work in practice, it's still "in the next release", but I have hope.

Logical Shift

Sunday, October 30, 2011

Misaligned I/Os

Wednesday, October 26, 2011

Firefox hang with FilerView