Storage Alignment
Lets start with a simple illustration of the problem. Most legacy operating systems (Linux included) like to include a 63 sector offset at the beginning of a drive. This is a real problem as now every read and write overlaps the block boundaries of a physical array. It doesn't matter if you are using VMFS or NFS to host a data store, it's the same problem. Yes an NFS repository will always be aligned, but remember this is a virtual representation to the OS, which happily messes everything up by offsetting its first partition.
Alignment is bad enough when we read. The storage array will pull two blocks when one is read, but it is of particular importance when we write data. Most arrays use some sort of parity to manage redundancy, and if you need to deal with two blocks for every one write request, the system overhead can be enormous. It's also important to keep in mind that every storage vendor has this issue. Even a raw, single drive can benefit from aligned partitions, especially when we consider most new drives ship with a 4KB sector size called Advanced Format.
The impact to each vendor will be slightly different. For example, EMC uses a 64KB block size, so not every write will be unaligned. NetApp uses a 4KB block, which means every write will be unaligned but they handle writes quite a bit differently as the block doesn't have to go back to the same place it came from. Pick your poison.
As you can see, when the OS blocks are aligned, everything through the stack can run at its optimal rate, where one OS request translates to one storage request.
Correctable Block Alignment
Fortunately most modern operating systems have recognized this problem and there is little to do. For example Windows 2008 now uses a 2048 cylinder offset (1MB) as do most current Linux distributions. For Linux, it is easy enough to check.
# fdisk -lu /dev/sdb
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x77cbefef
Device Boot Start End Blocks Id System
/dev/sdb1 2048 3907029167 1953513560 83 Linux
As you can see from this example, my starting cylinder is 2048. After that everything will align, including subsequent partitions. The -u option tells fdisk to display cylinders. I generally recommend this option when creating the partition as well although this seems to be the default for fdisk 2.19. You can also check your file system block size to better visualize how this relates through the stack. The following shows that my ext4 partition is using a 4KB block size:# dumpe2fs /dev/sdb1 | grep -i 'block size'
dumpe2fs 1.41.14 (22-Dec-2010)
Block size: 4096
Uncorrectable Block Alignment
VMware has come out with a nifty utility I first saw in VDI (now called View) and later placed into their cloud offering called a linked clone. Basically it allows you to create a copy of a machine using very little disk space quickly because it reads common data from the original source and writes data to a new location. Sounds a lot like snap shots doesn't it?
Well the problem with this approach is that every block written requires a little header to tell VMware where this new block belongs in the grand scheme of things. This is similar to our 63 cylinder offset but now for every block, nice. It's a good idea to start your master image off with an aligned file system as it will help with reading data but doesn't amount to much when you write. And does a windows desktop ever like to write. Just to exist (no workload), our testing has shown windows does 1-2 write IOs per second. Linux isn't completely off the hook, but generally does 1/3 to 1/2 of that and isn't that common with linked clones yet as it isn't supported in VDI but will get kicked around in vCloud.
Managing Bad Block Alignment
When using linked clones, there are a few steps you can take to minimize the impact:
- Refresh your images as often as you can. This will keep the journal file to a minimum and corresponding system overhead. If you can't refresh images, you probably shouldn't be using linked clones.
- Don't turn on extra features for those volumes like array based snapshots or de-duplication. The resources needed to track changes for both of these features can cause significant overhead. Use linked clones or de-dupe, not both.
- Monitor your progress on a periodic basis. I do this 3 times a day so we can track changes over time. If you can't measure it, you can't fix it.