DU Managers,
I got atleast 17 messages asking to to share the wealth. Since the
messages are keep comming, I decided to ignore the bandwith and send
this long summary from individuals that provided useful help and
scripts to monitor hsz40
Thanks and again sorry for the bandwidth...
Ronny
The Walt Disney Company
Disney Studios, Burbank California.
--
Tell the boss what you really think of him...and the truth shall set you
free. --Railway Clerk
------------------------------------------------------------------------
***** Bob.Capps_at_pscmail.ps.net wrote ******
Ronny,
A simple crontab with captured output is really all you need.
00,30 * * * * /usr/bin/hszterm -f /dev/rrza24a "show this_controller
full" >/tmp/hszterm.out
According to the manpage, if you pass a command string to hszterm,
that is all that it executes. The above crontab entry gave me the
following:
# cat /tmp/hszterm.out
Copyright Digital Equipment Corporation 1993, 1995. All rights
reserved. HSZ40 Firmware version V25Z-1, Hardware version B02
Last fail code: 018800A0
Press " ?" at any time for help.
HSZ>
Controller:
HSZ40 ZG54002333 Firmware V25Z-1, Hardware B02 Configured for
dual-redundancy with ZG54302696
In dual-redundant configuration
SCSI address 7
Time: NOT SET
Host port:
SCSI target(s) (0, 2, 4, 5), Preferred target(s) (0, 2, 4, 5)
Cache:
32 megabyte write cache, version 2
Cache is GOOD
Battery is GOOD
No unflushed data in cache
CACHE_FLUSH_TIMER = 65535 (seconds)
CACHE_POLICY = A
Licensing information:
RAID (RAID Option) is ENABLED, license key is VALID
WBCA (Writeback Cache Option) is ENABLED, license key is VALID MIRR
(Disk Mirroring Option) is ENABLED, license key is VALID
Extended information:
Terminal speed 9600 baud, eight bit, no parity, 1 stop bit Operation
control: 00000004 Security state code: 6566
HSZ>
#
Just stick this in some form of notification script with filters to
get the
info you want:
RESCD=`grep 'Cache|Battery' /tmp/hszterm.out | grep -v 'GOOD' | wc -l
|
sed 's/ //g`
if [ "$RESCD" != "0" ]; then
# Notify sysadmin
# ...
fi
Bob
Perot Systems
bob.capps_at_ps.net
p.s. As you can see, my policy is set to 'A' but after reading your
notice
at the bottom of your message, I think that I want it set to 'B'.
Thanks for the tip!
**********************************************************************
Date: Monday, 24 March 1997 7:25am ET To: Sendout
From: Stephen.Strobel_at_STC001
Subject: hsz40 question!
In-Reply-To: The letter of Friday, 21 March 1997 6:54pm ET
Ronny,
Most of the battery problems are "not neccisarily" battery problems.
Though they might be. HSOF versions prior to 2.7-2 caused batteries
to be reported bad when in reality they were OK. 2.7-2 (extra
patches) or 3.0-3 fixes this problem.
I have another question for you. I'm pushing DEC very very hard to
get them to move the "Dual Pathing" issue up on the develoment plans.
Assuming
that becuase this is a business critical system, I would assume that
you have a dual redundent controllers. Dual Pathing would provide two
SCSI busses
to each controller pair. If a controller, cable, KZPSA, DWLPA or hose
went bad the the devices on the controller would fail over to the
other
controller and thedevices at the OS level would also fail over. I
view this as a must for business critical systems. If you feel the
same way, I encourage you to contact your DEC sales rep and let him
know. I'm pushing this all the way to Palmer.
Now, to answer your questions, yes you can script. Here is one I am
using:
#]/bin/ksh
#
sysname=$(hostname)
print "host is" $sysname > /usr/users/root/hsz40/hsz40_$(date
+"%h-%d-%y").txt #
function hsz40_check
{
while read -r HSZ
do
hszterm -f $HSZ "show failedset"
done < /usr/users/root/hsz40/hsz40_list
}
hsz40_check >> /usr/users/root/hsz40/hsz40_$(date +"%h-%d-%y").txt
#while read -r line
#do case $(date +"%a") in
# Mon!Tue!Wed!Thu!Fri) cat /usr/users/root/hsz40/hsz40_$(date \
#+"%h-%d-%y").txt ! mailx -s "$(hostname)_$(date +"%h-%d-%y")_hsz40"
\ #$line%stc001_at_nodea.steel
I'm currently not using the mail feature. I produce a summery report
where I grep for errors and such.
Hope this helps. Call if you wish.
Steve Strobel
616 248-7497
**********************************************************************
Jeff.Beck_at_orcas.iasl.ca.boeing.com wrote **************
> Q: Is there a way to run hszterm (non-interactive/via crontab
entry) > and dump the output of "SHOW THIS_CONTROLLER FULL".
> It would be interesting to hear how other managers resolve this
> without using PolyCenter Console Manager type of software.
Ronny, here's the cron script I use which is along the lines of what
you
want to do, except I get paged via Console Manager. Jeff
#!/bin/ksh
####################################################
# #
# Boeing ASL NFS File Server #
# #
# Name: raid_check #
# #
# This script poles the HSZ40 Failedsets and #
# alarms to the syslog if a disk fails. #
# The message is picked up by the Console #
# Manager and a Sys-Administrator is notified. #
# #
# Created: 23-Apr-1996 Ben Johnson #
# #
LUMP=`hszterm -b2 -t5 -l0 "show failed" | grep DISK | cut -c 44-54`
FDRV=`expr substr "$LUMP" 3 7`
# echo "$FDRV"
if [ `expr "$FDRV" : "DISK"` != 0 ] ; then
logger -p 2 "RAID_check: HSZ #1 SCSI #2 ${FDRV%' '} has FAILED"
# echo "`hostname -s`: RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has
FAILED" fi
LUMP=`hszterm -b3 -t5 -l0 "show failed" | grep DISK | cut -c 44-54`
FDRV=`expr substr "$LUMP" 3 7`
# echo "$FDRV"
if [ `expr "$FDRV" : "DISK"` != 0 ] ; then
logger -p 2 "RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has FAILED"
# echo "`hostname -s`: RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has
FAILED" fi
**********************************************************************
Q: Is there a way to run hszterm (non-interactive/via crontab
entry) > and dump the output of "SHOW THIS_CONTROLLER FULL".
Yes, hszterm, you'll need to install:
SWACLI11A installed HSZ40 Array Controller Utility (Alpha)
> It would be interesting to hear how other managers resolve this
> without using PolyCenter Console Manager type of software.
We use polycenter console manager to retain console logs of our 7
hsz's and our three primary systems... it's been invaluable for
troubleshooting.
Besides that, we have a nightly script poll the hsz's for changes and
problems. I'll attach the script. It and some other tools can
be obtained via anonymous ftp
raven.alaska.edu:/pub/sois/UA_DUtools.tar.Z (the script may invoke an
ua* program for massaging data).
Battery problems are effectively resolved (allegedly), by using the
newer ones... I have more information buried someplace if you need
it... off the top of my head it's use the EDI ones only (scrap the
Hyundai). Also if you run dual-redundant (v3.0 only, v2.7 doesn't cut
it) you
are protected... it will (allegedly) failover on low battery in v3.0.
kurt
#!/bin/ksh
#Copyright (c) 1996-1997 by University of Alaska Computer Network
#
#950120 hszterm.ksh gather hsz configuration, report
changes #
#970119 sxkac change alert address to sdsys (alias to systems folks)
#960922 sxkac poll consoles separately; show raidset full
#960730 sxkac 1r on spike and 3n on nugget
#960511 sxkac modified reporting for hsz v2.7; deleted older history
######################################################################
######### # ALERT="sxkac "
ALERT="sdsys " # mail addresses for reporting
sanity="java" # sanity node for configuration copies
if (test -z "$UA_Profile") then # has our profile executed?
. ./.profile # nope, do it now (must be an rsh)
fi
cd $HOME/config # stick it in our config
directory
hostname=$(uname -n)
hostname=${hostname%%.*}
mv $hostname/hsz_*.* old
ER_LOG="$hostname/hsz_term.errors"
######################################################################
######### function check
# check if sts ok
{
echo "
Check: $1"
eval $1
# execute command sts=$?
# capture status
if ((0 == $sts))
then
return;
fi
# command ok? return...
echo "Error($sts): $1"
echo "Error($sts): $1" >> $ER_LOG return
}
#---------------------------------------------------------------------
--------- function err_chk
# check if sts ok
{
echo "Error($sts): $1"
echo "Error($sts): $1" >> $ER_LOG return
}
#---------------------------------------------------------------------
--------- function get_hsz
# get hsz information
{
sudo hszterm -f /dev/${1} \
"show devices full" > $hostname/hsz_${2}.devi sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show devices"; fi
sudo hszterm -f /dev/${1} \
"show units full" > $hostname/hsz_${2}.unit sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show units "; fi
sudo hszterm -f /dev/${1} \
"show raid full" > $hostname/hsz_${2}.raid sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show raid "; fi
sudo hszterm -f /dev/${1} \
"show mirror full" >> $hostname/hsz_${2}.raid sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show mirror"; fi
sudo hszterm -f /dev/${1} \
"show stripe full" >> $hostname/hsz_${2}.raid sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show stripe"; fi
sudo hszterm -f /dev/${1} \
"show this full" > $hostname/hsz_${2}.this sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show this "; fi
if [ -z "$3" ]; then
touch $hostname/hsz_${2}.othr
else
sudo hszterm -f /dev/${3} \
"show this full" > $hostname/hsz_${2}.othr sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) show other"; fi
fi
sudo hszterm -f /dev/${1} \
"run fmu" "show last most" > $hostname/hsz_${2}.errs sts=$?
if ((0 != $sts)) then err_chk "hsz(${2}) run fmu ..."; fi
}
#================================= function diffchk
{
# check two files for differences, based on: diffchk $(ls file.*)
# exit 0 identical
# exit 1 differences (if $OUT write $OUT.out; if $OLD mv file)
#
if ((2 != $#)) then
echo "Error($#): Incorrect argument count: $1 $2 $3 ..." return 2
fi
if (test ! -z "$UAKDF") then
# UAKDF requested?
DIFF="uakdf $UAKDF $2 $1"
else
DIFF="diff $2 $1"
fi
$DIFF
sts=$?
if ((0 == $sts)) then
echo "Identical, $2 deleted and $1 retained." rm $2
return 0
fi
if (test ! -z "$OUT") then $DIFF > $OUT.out ; fi if (test
! -z "$OLD") then mv $1 $OLD ; fi
return 1
}
######################################################################
#########
#
hszterm.ksh
case "$hostname" in
# so where are we?
glacier )
get_hsz rrz60c 1f rrz58c # SW-1 Front HSZ40 get_hsz rrz28c
2f rrz26c # SW-2 Front HSZ40
check " rsh spike job/hszterm.ksh"
check " rcp -p spike:config/spike/hsz_* $hostname"
check " rsh nugget job/hszterm.ksh"
check " rcp -p nugget:config/nugget/hsz_* $hostname"
check " rcp -p $hostname/hsz_* ${sanity}:config/$hostname"
;;
spike )
get_hsz rrz17c 1r rrzd20c # SW-1 Rear HSZ40
if [ -r $ER_LOG ]; then exit 1
else exit 0
fi
;;
nugget )
get_hsz rrz17c 3n # SW-3 n/a HSZ40
if [ -r $ER_LOG ]; then exit 1
else exit 0
fi
;;
* )
echo "$hostname is not configured in this procedure." \
>> $ER_LOG
;;
esac
#---------------------------------------------------------------------
--------- if [[ -r $ER_LOG ]]; then
echo "
Sending mail to: $ALERT " >> $ER_LOG
cat $ER_LOG \
| mailx -s "$hostname hszterm failed" $ALERT
exit 0 # always exit successfully
fi
#---------------------------------------------------------------------
--------- cd $HOME/config/$hostname
# change to our config directory
stamp=hsz_$(date +%y%m%d)
if [ -e $hostname/*$stamp ]; then # we've already run once
today...
stamp=hsz_$(date +%y%m%d%H%M%S)
fi
rm -f ../out/*hsz*.out ../out/*hsz*.msg
OLD=$HOME/config/old
unset UAKDF
touch ../out/$stamp.msg
echo "
______________________________________________________________________
________ Report hsz show device / unit
"
grep " disk " hsz_*.devi > x.0
uakce -m75,84,22 x.0 -o x.1 grep "
D" hsz_*.unit > x.0
uakce -m58,67,22 x.0 -o x.2 sort
-k1.5,1.32 -o hsz.$stamp x.1 x.2 rm
x.*
OUT=../out/hsz_sum_$stamp
UAKDF="-c5,32,18 -v"
diffchk $(ls hsz.*)
if ((1 == $?)) then
echo "
HSZ changes:
=== =======" >> ../out/$stamp.msg
cat $OUT.out >> ../out/$stamp.msg
fi
echo "
______________________________________________________________________
________ Report hsz fmu show last
"
grep -ve 'HSZ>
for help.
Copyright' hsz_*.errs > hszerr.$stamp unset UAKDF
unset OUT
diffchk $(ls hszerr.*)
if ((1 == $?)) then
echo "
HSZ errors:
=== ======" >> ../out/$stamp.msg
cat $(ls hszerr.*) >> ../out/$stamp.msg
touch ../out/hszerr_$stamp.out
fi
echo "
______________________________________________________________________
________ Report hsz show this & show other
"
grep -ve 'Time:
flushed data in cache' hsz_*.this hsz_*.othr > hszthis.$stamp
unset UAKDF
OUT=../out/hszthis_$stamp
diffchk $(ls hszthis.*)
if ((1 == $?)) then
echo "
HSZ show_this:
=== =========" >> ../out/$stamp.msg
cat $OUT.out >> ../out/$stamp.msg
fi
echo "
______________________________________________________________________
________ "
ls ../out/*hsz*.out
if ((0 == $?))
then
echo "Sending mail to: $ALERT"
cat ../out/$stamp.msg \
| mailx -s "$hostname hsz40 config changes" $ALERT
cat ../out/$stamp.msg
else
echo "There were NO configuration changes found."
fi
######################################################################
#########
exit 0 # always exit successfully
__________________________________________________________________
Kurt Carlson, University of Alaska, (907)474-6266 sxkac_at_alaska.edu
Received on Tue Mar 25 1997 - 23:46:11 NZST