For those that don't know about misaligned I/Os, I provided a brief introduction to them in my last
post. In this post I'll show you how to track and quantify how many of your I/Os exhibit the problem. We currently run our VMware infrastructure from a NetApp array, so this script is tailored to that environment.
The only method I've found to track is by using the
pw.over_limit counter. Unfortunately it is only available with advanced privileges on the command line as it isn't exported via SNMP. You can manually obtain the data with the following:
# priv set advanced
# wafl_susp -w
# priv set
This will produce lots of data, most of which you can ignore. You'll quickly notice a problem; pw.over_limit is an absolute number, like 29300. So what? Is that good, is it bad? That depends on how many I/Os you are generating in the first place. Since I love writing little programs, here is the output of my script:
$ cat NetApp-align-filer1-20111010-221001
22:10:01 UTC - Zeroing stats from filer filer1
22:10:02 UTC - sleeping for 300 seconds
22:15:02 UTC - Capturing data from filer filer1
22:15:03 UTC - Collected Values:
22:15:03 UTC - interval = 300.399491 seconds, pw.over_limit = 249 and WAFL_WRITE = 540822
22:15:03 UTC - Percentage Misaligned Writes = .0460%
22:15:03 UTC - Successfully Completed!
As you can see, there are a few misaligned writes going on, but overall in pretty good health with just under 0.05% of the total. When should you panic? That depends on way to many variables to list here but your write latency is a key indicator to pay attention to. Your application mix (and users) will scream when you've crossed the line.
The code I use is listed below. Read the comments in the parse_wafl function to figure out what it's doing.
#!/bin/bash
# Script Information
# na_alignment_check Version: 1.0
# Last Modified: Sept 18/2010
# Created By: Michael England
# to run this script you will need an account on the filer with administrative privileges as it has to be run with 'priv set advanced'
# ssh prep work
# ssh-keygen -t dsa -b 1024
# cat id_dsa.pub >> {filer}/etc/sshd/{user}/.ssh/authorized_keys
# default to local user
FILER_USER=$(whoami)
######
### function log
### logs activities to the screen, a file, or both
######
function log {
LOG_TYPE="$1"
LOG_MSG="$2"
TIME=`date +'%H:%M:%S %Z'`
# specify the log file only once
if [ -z $LOG_FILE ]
then
LOG_FILE="/work/Scripts/NetApp_Logs/NetApp-align-$FILER-`date +%Y%m%d-%H%M%S`"
fi
if [ $LOG_TYPE == "error" ]
then
echo -e "$TIME - **ERROR** - $LOG_MSG"
echo -e "$TIME - **ERROR** - $LOG_MSG" >> $LOG_FILE
elif [ $LOG_TYPE == "debug" ]
then
if [ $DEBUG == "on" ]
then
echo -e "DEBUG - $LOG_MSG"
fi
else
echo -e "$TIME - $LOG_MSG"
echo -e "$TIME - $LOG_MSG" >> "$LOG_FILE"
fi
}
######
### check_ssh
### check the return code of an ssh command
######
function check_ssh {
CHECK=$1
ERROR_DATA="$2"
if [ $CHECK != 0 ]
then
log error "ssh failed to filer"
log error "return is $ERROR_DATA"
exit 1
fi
}
######
### capture_wafl_susp
### ssh to the filer specified and collect the output from wafl_susp
######
function capture_wafl_susp {
log notify "Capturing data from filer $FILER"
SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -w;priv set" 2>&1`
RETVAL=$?
check_ssh $RETVAL "$SSH_RET"
parse_wafl $SSH_RET
}
######
### parse_wafl
### capture values for pw.over_limit and WAFL_WRITE and the overall scan interval
### e.g.
### WAFL statistics over 577455.189585 second(s) ...
### pw.over_limit = 29300
### New messages, restarts, suspends, and waffinity completions (by message-type):
### WAFL_WRITE = 10568010 122597 122597 0
###
### There are many other WAFL_WRITE lines so we need to find the New messages line first then the WAFL_WRITE in that section
######
function parse_wafl {
oldIFS=$IFS
IFS=$'\n'
NEW_FLAG=0
for line in echo $SSH_RET
do
if [[ $line =~ second* ]]
then
STATS_INTERVAL=`echo $line | awk '{print $4}'`
fi
if [[ $line =~ pw.over_limit* ]]
then
OVER_LIMIT=`echo $line | awk '{print $3}'`
fi
if [[ $line =~ New\ messages.* ]]
then
NEW_FLAG=1
fi
if [[ $NEW_FLAG == 1 ]] && [[ $line =~ WAFL_WRITE* ]]
then
WAFL_WRITE=`echo $line | awk '{print $3}'`
NEW_FLAG=0
fi
done
IFS=$oldIFS
if [[ -n $OVER_LIMIT ]] && [[ $WAFL_WRITE -gt 0 ]]
then
# by multiplying by 100 first we don't loose any precision
MISALIGNED_PERCENT=`echo "scale=4;100*$OVER_LIMIT/$WAFL_WRITE" | bc -l`
else
log error "Error collecting values, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
fi
log notify "Collected Values:"
log notify "interval = $STATS_INTERVAL seconds, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
log notify "Percentage Misaligned Writes = ${MISALIGNED_PERCENT}%"
}
######
### function zero_values
### zeroes out existing wafl stats on the filer
######
function zero_values {
log notify "Zeroing stats from filer $FILER"
SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -z;priv set" 2>&1`
RETVAL=$?
check_ssh $RETVAL "$SSH_RET"
}
######
### function usage
### simple user information for running the script
######
function usage {
echo -e ""
echo "Usage:"
echo "`basename $0` -filer {filer_name} [-username {user_name}] [-poll_interval ]"
echo -e "\t-filer {file_name} is a fully qualified domain name or IP of a filer to poll"
echo -e "\t-username {user_name} is a user to attach to the filer, if omitted will use current user"
echo -e "\t-poll_interval {x} will zero out the filer stats and sleep for {x} seconds then return. If omitted will read stats since last zeroed"
echo -e ""
exit 0
}
# parse command line options
if [ $# == 0 ]
then
usage
fi
until [ -z "$1" ]
do
case "$1" in
-filer)
shift
FILER="$1"
;;
-username)
shift
FILER_USER="$1"
;;
-poll_interval)
shift
POLL_INTERVAL="$1"
;;
*)
usage
;;
esac
shift
done
# do the work
if [[ -n $POLL_INTERVAL ]]
then
zero_values
log notify "sleeping for $POLL_INTERVAL seconds"
sleep $POLL_INTERVAL
fi
capture_wafl_susp
log notify "Successfully Completed!"