Sunday, November 13, 2011

Misaligned I/O reporting

In the last post I showed a script for logging misaligned I/Os on a NetApp storage array. However, it is nice to produce some graphs of the data, so here is my quick parse script to crank out a CSV file for a given filer. If you recall the output from the collection script was as follows:
22:10:01 UTC - Zeroing stats from filer np00003
22:10:02 UTC - sleeping for 300 seconds
22:15:02 UTC - Capturing data from filer np00003
22:15:03 UTC - Collected Values:
22:15:03 UTC - interval = 300.399491 seconds, pw.over_limit = 249 and WAFL_WRITE = 540822
22:15:03 UTC - Percentage Misaligned Writes = .0460%
22:15:03 UTC - Successfully Completed!
My log structure is kept under ./NetApp_Logs/ followed by a file for each log like so, NetApp-align-{filer}-{yyyymmdd}-{hhmmss}. This little script will walk through all of the logs for a given filer and output a CSV as shown here:
2011-10-10,22:15:03,.0460%
2011-10-11,10:15:03,.0670%
2011-10-11,22:15:03,.1500%
2011-10-12,20:15:03,.4500%
I have found this works well for spread sheet programs as it is able to parse out both the date and time appropriately, saving me a lot of time. And here is the script:
$ cat na_parse_alignment
if [ "$#" == 0 ]
then
        echo "Usage: `basename $0` "
        echo "e.g."
        echo "`basename $0` {filer}"
        exit 0
fi
FILER=$1
grep -H Percent NetApp_Logs/NetApp-align-${FILER}-*  | sed -e 's/\(2011\)\([0-9].\)\([0-9].\)\(-[0-9].....:\)/ \1-\2-\3 /g' | awk '{print $2" "$3" "$NF}' | tr " " , 

Sunday, November 6, 2011

Tracking Misaligned I/Os

For those that don't know about misaligned I/Os, I provided a brief introduction to them in my last post. In this post I'll show you how to track and quantify how many of your I/Os exhibit the problem. We currently run our VMware infrastructure from a NetApp array, so this script is tailored to that environment.

The only method I've found to track is by using the pw.over_limit counter. Unfortunately it is only available with advanced privileges on the command line as it isn't exported via SNMP. You can manually obtain the data with the following:
# priv set advanced
# wafl_susp -w
# priv set
This will produce lots of data, most of which you can ignore. You'll quickly notice a problem; pw.over_limit is an absolute number, like 29300. So what? Is that good, is it bad? That depends on how many I/Os you are generating in the first place. Since I love writing little programs, here is the output of my script:
$ cat NetApp-align-filer1-20111010-221001
22:10:01 UTC - Zeroing stats from filer filer1
22:10:02 UTC - sleeping for 300 seconds
22:15:02 UTC - Capturing data from filer filer1
22:15:03 UTC - Collected Values:
22:15:03 UTC - interval = 300.399491 seconds, pw.over_limit = 249 and WAFL_WRITE = 540822
22:15:03 UTC - Percentage Misaligned Writes = .0460%
22:15:03 UTC - Successfully Completed! 
As you can see, there are a few misaligned writes going on, but overall in pretty good health with just under 0.05% of the total. When should you panic? That depends on way to many variables to list here but your write latency is a key indicator to pay attention to. Your application mix (and users) will scream when you've crossed the line.

The code I use is listed below. Read the comments in the parse_wafl function to figure out what it's doing.
#!/bin/bash
# Script Information
# na_alignment_check Version: 1.0
# Last Modified: Sept 18/2010
# Created By: Michael England

# to run this script you will need an account on the filer with administrative privileges as it has to be run with 'priv set advanced'
# ssh prep work
# ssh-keygen -t dsa -b 1024
# cat id_dsa.pub >> {filer}/etc/sshd/{user}/.ssh/authorized_keys

# default to local user
FILER_USER=$(whoami)

######
### function log
### logs activities to the screen, a file, or both
######
function log {
 LOG_TYPE="$1"
 LOG_MSG="$2"
 TIME=`date +'%H:%M:%S %Z'`
 # specify the log file only once
 if [ -z $LOG_FILE ]
 then
  LOG_FILE="/work/Scripts/NetApp_Logs/NetApp-align-$FILER-`date +%Y%m%d-%H%M%S`"
 fi
 if [ $LOG_TYPE == "error" ]
 then
  echo -e "$TIME - **ERROR** - $LOG_MSG"
  echo -e "$TIME - **ERROR** - $LOG_MSG" >> $LOG_FILE
 elif [ $LOG_TYPE == "debug" ]
 then
  if [ $DEBUG == "on" ]
  then
   echo -e "DEBUG - $LOG_MSG"
  fi
 else
  echo -e "$TIME - $LOG_MSG"
  echo -e "$TIME - $LOG_MSG" >> "$LOG_FILE"
 fi
}

######
### check_ssh
### check the return code of an ssh command
######
function check_ssh {
 CHECK=$1
 ERROR_DATA="$2"
 if [ $CHECK != 0 ]
 then
  log error "ssh failed to filer"
  log error "return is $ERROR_DATA"
  exit 1
 fi
}

######
### capture_wafl_susp
### ssh to the filer specified and collect the output from wafl_susp
######

function capture_wafl_susp {
 log notify "Capturing data from filer $FILER"
 SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -w;priv set" 2>&1`
 RETVAL=$?
 check_ssh $RETVAL "$SSH_RET"

 parse_wafl $SSH_RET
}

######
### parse_wafl
### capture values for pw.over_limit and WAFL_WRITE and the overall scan interval
### e.g.
### WAFL statistics over 577455.189585 second(s) ...
### pw.over_limit = 29300
### New messages, restarts, suspends, and waffinity completions (by message-type):
### WAFL_WRITE           = 10568010   122597   122597        0
### 
### There are many other WAFL_WRITE lines so we need to find the New messages line first then the WAFL_WRITE in that section
######
function parse_wafl {
 oldIFS=$IFS
 IFS=$'\n'
 NEW_FLAG=0
 for line in echo $SSH_RET
 do
  if [[ $line =~ second* ]]
  then
   STATS_INTERVAL=`echo $line | awk '{print $4}'`
  fi

  if [[ $line =~ pw.over_limit* ]]
  then
   OVER_LIMIT=`echo $line | awk '{print $3}'`
  fi

  if [[ $line =~ New\ messages.* ]]
  then
   NEW_FLAG=1
  fi
  if [[ $NEW_FLAG == 1 ]] && [[ $line =~ WAFL_WRITE* ]]
  then
   WAFL_WRITE=`echo $line | awk '{print $3}'`
   NEW_FLAG=0
  fi
 done
 IFS=$oldIFS
 if [[ -n $OVER_LIMIT ]] && [[ $WAFL_WRITE -gt 0 ]]
 then 
  # by multiplying by 100 first we don't loose any precision
  MISALIGNED_PERCENT=`echo "scale=4;100*$OVER_LIMIT/$WAFL_WRITE" | bc -l`
 else
  log error "Error collecting values, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
 fi
 
 log notify "Collected Values:"
 log notify "interval = $STATS_INTERVAL seconds, pw.over_limit = $OVER_LIMIT and WAFL_WRITE = $WAFL_WRITE"
 log notify "Percentage Misaligned Writes = ${MISALIGNED_PERCENT}%"
}

######
### function zero_values
### zeroes out existing wafl stats on the filer
######
function zero_values {
 log notify "Zeroing stats from filer $FILER"
 SSH_RET=`ssh $FILER -l $FILER_USER "priv set advanced;wafl_susp -z;priv set" 2>&1`
 RETVAL=$?
 check_ssh $RETVAL "$SSH_RET"
}

######
### function usage
### simple user information for running the script
######
function usage {
 echo -e ""
 echo "Usage:"
 echo "`basename $0` -filer {filer_name} [-username {user_name}] [-poll_interval ]"
 echo -e "\t-filer {file_name} is a fully qualified domain name or IP of a filer to poll"
 echo -e "\t-username {user_name} is a user to attach to the filer, if omitted will use current user"
 echo -e "\t-poll_interval {x} will zero out the filer stats and sleep for {x} seconds then return.  If omitted will read stats since last zeroed"
 echo -e ""
 exit 0
}

# parse command line options
if [ $# == 0 ]
then
 usage
fi
until [ -z "$1" ]
do
 case "$1" in
 -filer)
  shift
  FILER="$1"
  ;;
 -username)
  shift
  FILER_USER="$1"
  ;;
 -poll_interval)
  shift
  POLL_INTERVAL="$1"
  ;;
 *)
  usage
  ;;
 esac
 shift
done
# do the work

if [[ -n $POLL_INTERVAL ]]
then
 zero_values
 log notify "sleeping for $POLL_INTERVAL seconds"
 sleep $POLL_INTERVAL
fi
capture_wafl_susp

log notify "Successfully Completed!"