Logical Shift: I/O

Showing posts with label I/O. Show all posts

Sunday, November 13, 2011

Misaligned I/O reporting

In the last post I showed a script for logging misaligned I/Os on a NetApp storage array. However, it is nice to produce some graphs of the data, so here is my quick parse script to crank out a CSV file for a given filer. If you recall the output from the collection script was as follows:

22:10:01 UTC - Zeroing stats from filer np00003
22:10:02 UTC - sleeping for 300 seconds
22:15:02 UTC - Capturing data from filer np00003
22:15:03 UTC - Collected Values:
22:15:03 UTC - interval = 300.399491 seconds, pw.over_limit = 249 and WAFL_WRITE = 540822
22:15:03 UTC - Percentage Misaligned Writes = .0460%
22:15:03 UTC - Successfully Completed!

My log structure is kept under ./NetApp_Logs/ followed by a file for each log like so, NetApp-align-{filer}-{yyyymmdd}-{hhmmss}. This little script will walk through all of the logs for a given filer and output a CSV as shown here:

2011-10-10,22:15:03,.0460%
2011-10-11,10:15:03,.0670%
2011-10-11,22:15:03,.1500%
2011-10-12,20:15:03,.4500%

I have found this works well for spread sheet programs as it is able to parse out both the date and time appropriately, saving me a lot of time. And here is the script:

$ cat na_parse_alignment
if [ "$#" == 0 ]
then
        echo "Usage: `basename $0` "
        echo "e.g."
        echo "`basename $0` {filer}"
        exit 0
fi
FILER=$1
grep -H Percent NetApp_Logs/NetApp-align-${FILER}-*  | sed -e 's/\(2011\)\([0-9].\)\([0-9].\)\(-[0-9].....:\)/ \1-\2-\3 /g' | awk '{print $2" "$3" "$NF}' | tr " " ,

Sunday, November 6, 2011

Tracking Misaligned I/Os

For those that don't know about misaligned I/Os, I provided a brief introduction to them in my last post. In this post I'll show you how to track and quantify how many of your I/Os exhibit the problem. We currently run our VMware infrastructure from a NetApp array, so this script is tailored to that environment.

The only method I've found to track is by using the pw.over_limit counter. Unfortunately it is only available with advanced privileges on the command line as it isn't exported via SNMP. You can manually obtain the data with the following:

# priv set advanced
# wafl_susp -w
# priv set

This will produce lots of data, most of which you can ignore. You'll quickly notice a problem; pw.over_limit is an absolute number, like 29300. So what? Is that good, is it bad? That depends on how many I/Os you are generating in the first place. Since I love writing little programs, here is the output of my script:

$ cat NetApp-align-filer1-20111010-221001
22:10:01 UTC - Zeroing stats from filer filer1
22:10:02 UTC - sleeping for 300 seconds
22:15:02 UTC - Capturing data from filer filer1
22:15:03 UTC - Collected Values:
22:15:03 UTC - interval = 300.399491 seconds, pw.over_limit = 249 and WAFL_WRITE = 540822
22:15:03 UTC - Percentage Misaligned Writes = .0460%
22:15:03 UTC - Successfully Completed!

As you can see, there are a few misaligned writes going on, but overall in pretty good health with just under 0.05% of the total. When should you panic? That depends on way to many variables to list here but your write latency is a key indicator to pay attention to. Your application mix (and users) will scream when you've crossed the line.

The code I use is listed below. Read the comments in the parse_wafl function to figure out what it's doing.

#!/bin/bash
# Script Information
# na_alignment_check Version: 1.0
# Last Modified: Sept 18/2010
# Created By: Michael England

# to run this script you will need an account on the filer with administrative privileges as it has to be run with 'priv set advanced'
# ssh prep work
# ssh-keygen -t dsa -b 1024
# cat id_dsa.pub >> {filer}/etc/sshd/{user}/.ssh/authorized_keys

# default to local user
FILER_USER=$(whoami)

######
### function log
### logs activities to the screen, a file, or both
######
function log {
 LOG_TYPE="$1"
 LOG_MSG="$2"
 TIME=`date +'%H:%M:%S %Z'`
 # specify the log file only once
 if [ -z $LOG_FILE ]
 then
  LOG_FILE="/work/Scripts/NetApp_Logs/NetApp-align-$FILER-`date +%Y%m%d-%H%M%S`"
 fi
 if [ $LOG_TYPE == "error" ]
 then
  echo -e "$TIME - **ERROR** - $LOG_MSG"
  echo -e "$TIME - **ERROR** - $LOG_MSG" >> $LOG_FILE
 elif [ $LOG_TYPE == "debug" ]
 then
  if [ $DEBUG == "on" ]
  then
   echo -e "DEBUG - $LOG_MSG"
  fi
 else
  echo -e "$TIME - $LOG_MSG"
  echo -e "$TIME - $LOG_MSG" >> "$LOG_FILE"
 fi
}

######
### check_ssh
### check the return code of an ssh command
######
function check_ssh {
 CHECK=$1
 ERROR_DATA="$2"
 if [ $CHECK != 0 ]
 then
  log error "ssh failed to filer"
  log error "return is $ERROR_DATA"
  exit 1
 fi
}

######
### capture_wafl_susp
### ssh to the filer specified and collect the output from wafl_susp
######

function capture_wafl_susp {
 log notify "Capturing data from filer $FILER"
 SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -w;priv set" 2>&1`
 RETVAL=$?
 check_ssh $RETVAL "$SSH_RET"

 parse_wafl $SSH_RET
}

######
### parse_wafl
### capture values for pw.over_limit and WAFL_WRITE and the overall scan interval
### e.g.
### WAFL statistics over 577455.189585 second(s) ...
### pw.over_limit = 29300
### New messages, restarts, suspends, and waffinity completions (by message-type):
### WAFL_WRITE           = 10568010   122597   122597        0
### 
### There are many other WAFL_WRITE lines so we need to find the New messages line first then the WAFL_WRITE in that section
######
function parse_wafl {
 oldIFS=$IFS
 IFS=$'\n'
 NEW_FLAG=0
 for line in echo $SSH_RET
 do
  if [[ $line =~ second* ]]
  then
   STATS_INTERVAL=`echo $line | awk '{print $4}'`
  fi

  if [[ $line =~ pw.over_limit* ]]
  then
   OVER_LIMIT=`echo $line | awk '{print $3}'`
  fi

  if [[ $line =~ New\ messages.* ]]
  then
   NEW_FLAG=1
  fi
  if [[ $NEW_FLAG == 1 ]] && [[ $line =~ WAFL_WRITE* ]]
  then
   WAFL_WRITE=`echo $line | awk '{print $3}'`
   NEW_FLAG=0
  fi
 done
 IFS=$oldIFS
 if [[ -n $OVER_LIMIT ]] && [[ $WAFL_WRITE -gt 0 ]]
 then 
  # by multiplying by 100 first we don't loose any precision
  MISALIGNED_PERCENT=`echo "scale=4;100*$OVER_LIMIT/$WAFL_WRITE" | bc -l`
 else
  log error "Error collecting values, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
 fi
 
 log notify "Collected Values:"
 log notify "interval = $STATS_INTERVAL seconds, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
 log notify "Percentage Misaligned Writes = ${MISALIGNED_PERCENT}%"
}

######
### function zero_values
### zeroes out existing wafl stats on the filer
######
function zero_values {
 log notify "Zeroing stats from filer $FILER"
 SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -z;priv set" 2>&1`
 RETVAL=$?
 check_ssh $RETVAL "$SSH_RET"
}

######
### function usage
### simple user information for running the script
######
function usage {
 echo -e ""
 echo "Usage:"
 echo "`basename $0` -filer {filer_name} [-username {user_name}] [-poll_interval ]"
 echo -e "\t-filer {file_name} is a fully qualified domain name or IP of a filer to poll"
 echo -e "\t-username {user_name} is a user to attach to the filer, if omitted will use current user"
 echo -e "\t-poll_interval {x} will zero out the filer stats and sleep for {x} seconds then return.  If omitted will read stats since last zeroed"
 echo -e ""
 exit 0
}

# parse command line options
if [ $# == 0 ]
then
 usage
fi
until [ -z "$1" ]
do
 case "$1" in
 -filer)
  shift
  FILER="$1"
  ;;
 -username)
  shift
  FILER_USER="$1"
  ;;
 -poll_interval)
  shift
  POLL_INTERVAL="$1"
  ;;
 *)
  usage
  ;;
 esac
 shift
done
# do the work

if [[ -n $POLL_INTERVAL ]]
then
 zero_values
 log notify "sleeping for $POLL_INTERVAL seconds"
 sleep $POLL_INTERVAL
fi
capture_wafl_susp

log notify "Successfully Completed!"

Sunday, October 30, 2011

Misaligned I/Os

While not unique to virtualization, it generally doesn't cause much of a problem until you consolidate a whole bunch of poorly setup partitions onto one array that things tend to go from all right to a really bad day. The fundamental problem is a mismatch between where the OS places data and where the storage array ultimately keeps it. Both work in logical chunks of data and both present a virtual view of this to the higher layers. There are two specific cases that I'd like to address, one that you can fix, and one that you can only manage.

Storage Alignment
Lets start with a simple illustration of the problem.

Most legacy operating systems (Linux included) like to include a 63 sector offset at the beginning of a drive. This is a real problem as now every read and write overlaps the block boundaries of a physical array. It doesn't matter if you are using VMFS or NFS to host a data store, it's the same problem. Yes an NFS repository will always be aligned, but remember this is a virtual representation to the OS, which happily messes everything up by offsetting its first partition.

Alignment is bad enough when we read. The storage array will pull two blocks when one is read, but it is of particular importance when we write data. Most arrays use some sort of parity to manage redundancy, and if you need to deal with two blocks for every one write request, the system overhead can be enormous. It's also important to keep in mind that every storage vendor has this issue. Even a raw, single drive can benefit from aligned partitions, especially when we consider most new drives ship with a 4KB sector size called Advanced Format.

The impact to each vendor will be slightly different. For example, EMC uses a 64KB block size, so not every write will be unaligned. NetApp uses a 4KB block, which means every write will be unaligned but they handle writes quite a bit differently as the block doesn't have to go back to the same place it came from. Pick your poison.

As you can see, when the OS blocks are aligned, everything through the stack can run at its optimal rate, where one OS request translates to one storage request.

Correctable Block Alignment
Fortunately most modern operating systems have recognized this problem and there is little to do. For example Windows 2008 now uses a 2048 cylinder offset (1MB) as do most current Linux distributions. For Linux, it is easy enough to check.

# fdisk -lu /dev/sdb

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x77cbefef

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  3907029167  1953513560   83  Linux

As you can see from this example, my starting cylinder is 2048. After that everything will align, including subsequent partitions. The -u option tells fdisk to display cylinders. I generally recommend this option when creating the partition as well although this seems to be the default for fdisk 2.19. You can also check your file system block size to better visualize how this relates through the stack. The following shows that my ext4 partition is using a 4KB block size:

# dumpe2fs /dev/sdb1 | grep -i 'block size'
dumpe2fs 1.41.14 (22-Dec-2010)
Block size:               4096

Uncorrectable Block Alignment
VMware has come out with a nifty utility I first saw in VDI (now called View) and later placed into their cloud offering called a linked clone. Basically it allows you to create a copy of a machine using very little disk space quickly because it reads common data from the original source and writes data to a new location. Sounds a lot like snap shots doesn't it?

Well the problem with this approach is that every block written requires a little header to tell VMware where this new block belongs in the grand scheme of things. This is similar to our 63 cylinder offset but now for every block, nice. It's a good idea to start your master image off with an aligned file system as it will help with reading data but doesn't amount to much when you write. And does a windows desktop ever like to write. Just to exist (no workload), our testing has shown windows does 1-2 write IOs per second. Linux isn't completely off the hook, but generally does 1/3 to 1/2 of that and isn't that common with linked clones yet as it isn't supported in VDI but will get kicked around in vCloud.

Managing Bad Block Alignment

When using linked clones, there are a few steps you can take to minimize the impact:

Refresh your images as often as you can. This will keep the journal file to a minimum and corresponding system overhead. If you can't refresh images, you probably shouldn't be using linked clones.
Don't turn on extra features for those volumes like array based snapshots or de-duplication. The resources needed to track changes for both of these features can cause significant overhead. Use linked clones or de-dupe, not both.
Monitor your progress on a periodic basis. I do this 3 times a day so we can track changes over time. If you can't measure it, you can't fix it.

In the future VAAI is promised to save us by both VMware and every storage vendor that can spell. It's intent is to perform the same linked clone api call but let the storage array figure out the best method of managing the problem. I've yet to see it work in practice, it's still "in the next release", but I have hope.