Since several users requested it, update of summary containig the 
complete procedure follows:
a) Problem:
Power failure longer than the connected UPS could stand - 30 minutes. 
After 20 minutes, the UPS software initiated a shutdown; unfortunately, 
the shutdown was not completed and at that very moment the routine 
mirroring of the main and backup raidsets was running ... We ended up 
with a server that came up without a problem but was unable to find the 
external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3 
raidsets with 6 disks and 6 hot spares, each raidset one unit: main 
D100, mirror data D200, mirror web D300). HSZ lights were showing 
operative condition: channel leds off, reset light blinking.
b) Diagnosis:
tried to mount the main unit manually - fail. Checked /etc/fdmns - 
domains missing. Checked /dev/rrz17 - rrz19 files - ok.
tried to connect to hsz using hszterm -f /dev/rrz17g - fail.
connected notebook to serial port of HSZ40.
SHOW THIS revealed
This controller has an invalid cache module
Controllers misconfigured.  Type SHOW THIS_CONTROLLER
Power Supply failure cleared.
Invalid cache -- CLI command set reduced.  Type SHOW THIS_CONTROLLER. 
Please - see user guide to determine corrective action
SHOW OTHER showed ok.
user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:
CLEAR_ERRORS INVALID_CACHE controller
Tried this, but in vain. Desperation. UARRRRGH!
Switching to offsite mirror, posting call for assistance to 
tru64-unix-managers_at_ornl.gov, hopping around madly, lighting a candle, 
praying.
Answer by Phil Baldwin showed that the syntax suggested by the user 
guide was simply wrong. The correct syntax was:
CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or 
[nodestroy_unflushed_data]
Applying this - ok.
Connecting notebook to defective controller (!!!), did  SET THIS 
NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER  (Dangerous - 
dont confuse the controllers here - COPY=[SOURCE] !!!
ok, controllers back online.
Show raid full: ok.
Show units full:
    LUN                                      Uses
--------------------------------------------------------------
   D100                                       R1
         Switches:
           RUN                    NOWRITE_PROTECT        READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 41879900 blocks
   D200                                       R2
         Switches:
           RUN                    NOWRITE_PROTECT        READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
   D300                                       R3
         Switches:
           RUN                    NOWRITE_PROTECT        READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
Cache battery charge is low
OK, have to bring the units to operative state again.
Solution:
CLEAR_ERRORS LOST_DATA unit-number
brought them back to operative state.
All data and all sets ok. No further problems.
Have to figure out the problem with the powerfail shutdown script anyway 
- I guess the system should come back up in stable condition after the 
shutdown initiated by xpowerchute.
Thanks to all who replied and helped!
CW
Received on Mon Dec 20 2004 - 07:22:00 NZDT