SUMMARY:hsz40 question

From: Ronny Eliahu <ronny_eliahu_at_corp.disney.com>
Date: Tue, 25 Mar 97 14:23:59 PST

     DU Managers,
     
     I got atleast 17 messages asking to to share the wealth. Since the
     messages are keep comming, I decided to ignore the bandwith and send
     this long summary from individuals that provided useful help and
     scripts to monitor hsz40
     
     Thanks and again sorry for the bandwidth...
     
     Ronny
     The Walt Disney Company
     Disney Studios, Burbank California.
     --
     
     Tell the boss what you really think of him...and the truth shall set you
     free. --Railway Clerk
     ------------------------------------------------------------------------
     
     ***** Bob.Capps_at_pscmail.ps.net wrote ******
     
     Ronny,
     
     A simple crontab with captured output is really all you need.
     
     00,30 * * * * /usr/bin/hszterm -f /dev/rrza24a "show this_controller
     full" >/tmp/hszterm.out
     
     According to the manpage, if you pass a command string to hszterm,
     that is all that it executes. The above crontab entry gave me the
     following:
     
     # cat /tmp/hszterm.out
     
     
     Copyright Digital Equipment Corporation 1993, 1995. All rights
     reserved. HSZ40 Firmware version V25Z-1, Hardware version B02
     
     Last fail code: 018800A0
     
     Press " ?" at any time for help.
     
     
     HSZ>
     Controller:
     HSZ40 ZG54002333 Firmware V25Z-1, Hardware B02 Configured for
     dual-redundancy with ZG54302696
     In dual-redundant configuration
     SCSI address 7
     Time: NOT SET
     Host port:
     SCSI target(s) (0, 2, 4, 5), Preferred target(s) (0, 2, 4, 5)
     Cache:
     32 megabyte write cache, version 2
     Cache is GOOD
     Battery is GOOD
     No unflushed data in cache
     CACHE_FLUSH_TIMER = 65535 (seconds)
     CACHE_POLICY = A
     Licensing information:
     RAID (RAID Option) is ENABLED, license key is VALID
     WBCA (Writeback Cache Option) is ENABLED, license key is VALID MIRR
     (Disk Mirroring Option) is ENABLED, license key is VALID
     Extended information:
     Terminal speed 9600 baud, eight bit, no parity, 1 stop bit Operation
     control: 00000004 Security state code: 6566
     HSZ>
     
     
     #
     
     Just stick this in some form of notification script with filters to
     get the
     
     info you want:
     
     RESCD=`grep 'Cache|Battery' /tmp/hszterm.out | grep -v 'GOOD' | wc -l
     |
     sed 's/ //g`
     if [ "$RESCD" != "0" ]; then
     # Notify sysadmin
     # ...
     fi
     
     Bob
     
     Perot Systems
     bob.capps_at_ps.net
     
     p.s. As you can see, my policy is set to 'A' but after reading your
     notice
     
     at the bottom of your message, I think that I want it set to 'B'.
     Thanks for the tip!
     
     **********************************************************************
     Date: Monday, 24 March 1997 7:25am ET To: Sendout
     From: Stephen.Strobel_at_STC001
     Subject: hsz40 question!
     In-Reply-To: The letter of Friday, 21 March 1997 6:54pm ET
     
     
     Ronny,
     
     Most of the battery problems are "not neccisarily" battery problems.
     Though they might be. HSOF versions prior to 2.7-2 caused batteries
     to be reported bad when in reality they were OK. 2.7-2 (extra
     patches) or 3.0-3 fixes this problem.
     
     I have another question for you. I'm pushing DEC very very hard to
     get them to move the "Dual Pathing" issue up on the develoment plans.
     Assuming
     that becuase this is a business critical system, I would assume that
     you have a dual redundent controllers. Dual Pathing would provide two
     SCSI busses
     to each controller pair. If a controller, cable, KZPSA, DWLPA or hose
     went bad the the devices on the controller would fail over to the
     other
     controller and thedevices at the OS level would also fail over. I
     view this as a must for business critical systems. If you feel the
     same way, I encourage you to contact your DEC sales rep and let him
     know. I'm pushing this all the way to Palmer.
     
     Now, to answer your questions, yes you can script. Here is one I am
     using:
     
     #]/bin/ksh
     #
     sysname=$(hostname)
     print "host is" $sysname > /usr/users/root/hsz40/hsz40_$(date
     +"%h-%d-%y").txt #
     function hsz40_check
     {
     while read -r HSZ
     do
     hszterm -f $HSZ "show failedset"
     done < /usr/users/root/hsz40/hsz40_list
     }
     hsz40_check >> /usr/users/root/hsz40/hsz40_$(date +"%h-%d-%y").txt
     #while read -r line
     #do case $(date +"%a") in
     # Mon!Tue!Wed!Thu!Fri) cat /usr/users/root/hsz40/hsz40_$(date \
     #+"%h-%d-%y").txt ! mailx -s "$(hostname)_$(date +"%h-%d-%y")_hsz40"
     \ #$line%stc001_at_nodea.steel
     
     I'm currently not using the mail feature. I produce a summery report
     where I grep for errors and such.
     
     Hope this helps. Call if you wish.
     
     Steve Strobel
     616 248-7497
     **********************************************************************
     
     Jeff.Beck_at_orcas.iasl.ca.boeing.com wrote **************
     
> Q: Is there a way to run hszterm (non-interactive/via crontab
     entry) > and dump the output of "SHOW THIS_CONTROLLER FULL".
> It would be interesting to hear how other managers resolve this
> without using PolyCenter Console Manager type of software.
     
     Ronny, here's the cron script I use which is along the lines of what
     you
     want to do, except I get paged via Console Manager. Jeff
     
     
     #!/bin/ksh
     ####################################################
     # #
     # Boeing ASL NFS File Server #
     # #
     # Name: raid_check #
     # #
     # This script poles the HSZ40 Failedsets and #
     # alarms to the syslog if a disk fails. #
     # The message is picked up by the Console #
     # Manager and a Sys-Administrator is notified. #
     # #
     # Created: 23-Apr-1996 Ben Johnson #
     # #
     
     LUMP=`hszterm -b2 -t5 -l0 "show failed" | grep DISK | cut -c 44-54`
     FDRV=`expr substr "$LUMP" 3 7`
     # echo "$FDRV"
     if [ `expr "$FDRV" : "DISK"` != 0 ] ; then
     logger -p 2 "RAID_check: HSZ #1 SCSI #2 ${FDRV%' '} has FAILED"
     # echo "`hostname -s`: RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has
     FAILED" fi
     
     LUMP=`hszterm -b3 -t5 -l0 "show failed" | grep DISK | cut -c 44-54`
     FDRV=`expr substr "$LUMP" 3 7`
     # echo "$FDRV"
     if [ `expr "$FDRV" : "DISK"` != 0 ] ; then
     logger -p 2 "RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has FAILED"
     # echo "`hostname -s`: RAID_check: HSZ #1 SCSI #3 ${FDRV%' '} has
     FAILED" fi
     
     **********************************************************************
     
          Q: Is there a way to run hszterm (non-interactive/via crontab
     entry) > and dump the output of "SHOW THIS_CONTROLLER FULL".
     
     Yes, hszterm, you'll need to install:
     SWACLI11A installed HSZ40 Array Controller Utility (Alpha)
     
> It would be interesting to hear how other managers resolve this
> without using PolyCenter Console Manager type of software.
     
     We use polycenter console manager to retain console logs of our 7
     hsz's and our three primary systems... it's been invaluable for
     troubleshooting.
     
     Besides that, we have a nightly script poll the hsz's for changes and
     problems. I'll attach the script. It and some other tools can
     be obtained via anonymous ftp
     raven.alaska.edu:/pub/sois/UA_DUtools.tar.Z (the script may invoke an
     ua* program for massaging data).
     
     Battery problems are effectively resolved (allegedly), by using the
     newer ones... I have more information buried someplace if you need
     it... off the top of my head it's use the EDI ones only (scrap the
     Hyundai). Also if you run dual-redundant (v3.0 only, v2.7 doesn't cut
     it) you
     are protected... it will (allegedly) failover on low battery in v3.0.
     kurt
     
     #!/bin/ksh
     #Copyright (c) 1996-1997 by University of Alaska Computer Network
     #
     #950120 hszterm.ksh gather hsz configuration, report
     changes #
     #970119 sxkac change alert address to sdsys (alias to systems folks)
     #960922 sxkac poll consoles separately; show raidset full
     #960730 sxkac 1r on spike and 3n on nugget
     #960511 sxkac modified reporting for hsz v2.7; deleted older history
     ######################################################################
     ######### # ALERT="sxkac "
     ALERT="sdsys " # mail addresses for reporting
     sanity="java" # sanity node for configuration copies
     if (test -z "$UA_Profile") then # has our profile executed?
     . ./.profile # nope, do it now (must be an rsh)
     fi
     cd $HOME/config # stick it in our config
     directory
     hostname=$(uname -n)
     hostname=${hostname%%.*}
     mv $hostname/hsz_*.* old
     ER_LOG="$hostname/hsz_term.errors"
     ######################################################################
     ######### function check
     # check if sts ok
     {
     echo "
     Check: $1"
     eval $1
     # execute command sts=$?
     # capture status
     
     if ((0 == $sts))
     then
     return;
     fi
     # command ok? return...
     
     echo "Error($sts): $1"
     echo "Error($sts): $1" >> $ER_LOG return
     }
     #---------------------------------------------------------------------
     --------- function err_chk
     # check if sts ok
     {
     echo "Error($sts): $1"
     echo "Error($sts): $1" >> $ER_LOG return
     }
     #---------------------------------------------------------------------
     --------- function get_hsz
     # get hsz information
     {
     sudo hszterm -f /dev/${1} \
     "show devices full" > $hostname/hsz_${2}.devi sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show devices"; fi
     
     sudo hszterm -f /dev/${1} \
     "show units full" > $hostname/hsz_${2}.unit sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show units "; fi
     
     sudo hszterm -f /dev/${1} \
     "show raid full" > $hostname/hsz_${2}.raid sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show raid "; fi
     
     sudo hszterm -f /dev/${1} \
     "show mirror full" >> $hostname/hsz_${2}.raid sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show mirror"; fi
     
     sudo hszterm -f /dev/${1} \
     "show stripe full" >> $hostname/hsz_${2}.raid sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show stripe"; fi
     
     sudo hszterm -f /dev/${1} \
     "show this full" > $hostname/hsz_${2}.this sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show this "; fi
     
     if [ -z "$3" ]; then
     touch $hostname/hsz_${2}.othr
     else
     sudo hszterm -f /dev/${3} \
     "show this full" > $hostname/hsz_${2}.othr sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) show other"; fi
     fi
     
     sudo hszterm -f /dev/${1} \
     "run fmu" "show last most" > $hostname/hsz_${2}.errs sts=$?
     if ((0 != $sts)) then err_chk "hsz(${2}) run fmu ..."; fi
     }
     #================================= function diffchk
     {
     # check two files for differences, based on: diffchk $(ls file.*)
     # exit 0 identical
     # exit 1 differences (if $OUT write $OUT.out; if $OLD mv file)
     #
     if ((2 != $#)) then
     echo "Error($#): Incorrect argument count: $1 $2 $3 ..." return 2
     fi
     if (test ! -z "$UAKDF") then
     # UAKDF requested?
     DIFF="uakdf $UAKDF $2 $1"
     else
     DIFF="diff $2 $1"
     fi
     $DIFF
     sts=$?
     if ((0 == $sts)) then
     echo "Identical, $2 deleted and $1 retained." rm $2
     return 0
     fi
     if (test ! -z "$OUT") then $DIFF > $OUT.out ; fi if (test
     ! -z "$OLD") then mv $1 $OLD ; fi
     return 1
     }
     ######################################################################
     #########
     #
     hszterm.ksh
     
     
     case "$hostname" in
     # so where are we?
     
     glacier )
     get_hsz rrz60c 1f rrz58c # SW-1 Front HSZ40 get_hsz rrz28c
     2f rrz26c # SW-2 Front HSZ40
     
     check " rsh spike job/hszterm.ksh"
     check " rcp -p spike:config/spike/hsz_* $hostname"
     
     check " rsh nugget job/hszterm.ksh"
     check " rcp -p nugget:config/nugget/hsz_* $hostname"
     
     check " rcp -p $hostname/hsz_* ${sanity}:config/$hostname"
     ;;
     spike )
     get_hsz rrz17c 1r rrzd20c # SW-1 Rear HSZ40
     
     if [ -r $ER_LOG ]; then exit 1
     else exit 0
     fi
     ;;
     nugget )
     get_hsz rrz17c 3n # SW-3 n/a HSZ40
     
     if [ -r $ER_LOG ]; then exit 1
     else exit 0
     fi
     ;;
     * )
     echo "$hostname is not configured in this procedure." \
>> $ER_LOG
     ;;
     esac
     #---------------------------------------------------------------------
     --------- if [[ -r $ER_LOG ]]; then
     echo "
     Sending mail to: $ALERT " >> $ER_LOG
     
     cat $ER_LOG \
     | mailx -s "$hostname hszterm failed" $ALERT
     exit 0 # always exit successfully
     fi
     #---------------------------------------------------------------------
     --------- cd $HOME/config/$hostname
     # change to our config directory
     
     stamp=hsz_$(date +%y%m%d)
     if [ -e $hostname/*$stamp ]; then # we've already run once
     today...
     stamp=hsz_$(date +%y%m%d%H%M%S)
     fi
     
     rm -f ../out/*hsz*.out ../out/*hsz*.msg
     OLD=$HOME/config/old
     unset UAKDF
     touch ../out/$stamp.msg
     
     echo "
     ______________________________________________________________________
     ________ Report hsz show device / unit
     "
     grep " disk " hsz_*.devi > x.0
     uakce -m75,84,22 x.0 -o x.1 grep "
     D" hsz_*.unit > x.0
     uakce -m58,67,22 x.0 -o x.2 sort
     -k1.5,1.32 -o hsz.$stamp x.1 x.2 rm
                                  x.*
     OUT=../out/hsz_sum_$stamp
     UAKDF="-c5,32,18 -v"
     diffchk $(ls hsz.*)
     if ((1 == $?)) then
     echo "
     HSZ changes:
     === =======" >> ../out/$stamp.msg
     cat $OUT.out >> ../out/$stamp.msg
     fi
     echo "
     ______________________________________________________________________
     ________ Report hsz fmu show last
     "
     grep -ve 'HSZ>
     for help.
     Copyright' hsz_*.errs > hszerr.$stamp unset UAKDF
     unset OUT
     
     diffchk $(ls hszerr.*)
     if ((1 == $?)) then
     echo "
     HSZ errors:
     === ======" >> ../out/$stamp.msg
     cat $(ls hszerr.*) >> ../out/$stamp.msg
     touch ../out/hszerr_$stamp.out
     fi
     echo "
     ______________________________________________________________________
     ________ Report hsz show this & show other
     "
     grep -ve 'Time:
     flushed data in cache' hsz_*.this hsz_*.othr > hszthis.$stamp
     
     unset UAKDF
     OUT=../out/hszthis_$stamp
     
     diffchk $(ls hszthis.*)
     if ((1 == $?)) then
     echo "
     HSZ show_this:
     === =========" >> ../out/$stamp.msg
     cat $OUT.out >> ../out/$stamp.msg
     fi
     
     echo "
     ______________________________________________________________________
     ________ "
     ls ../out/*hsz*.out
     if ((0 == $?))
     then
     echo "Sending mail to: $ALERT"
     cat ../out/$stamp.msg \
     | mailx -s "$hostname hsz40 config changes" $ALERT
     cat ../out/$stamp.msg
     else
     echo "There were NO configuration changes found."
     fi
     ######################################################################
     #########
     exit 0 # always exit successfully
     
     __________________________________________________________________
     Kurt Carlson, University of Alaska, (907)474-6266 sxkac_at_alaska.edu
     
Received on Tue Mar 25 1997 - 23:46:11 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT