Since several users requested it, update of summary containig the
complete procedure follows:
a) Problem:
Power failure longer than the connected UPS could stand - 30 minutes.
After 20 minutes, the UPS software initiated a shutdown; unfortunately,
the shutdown was not completed and at that very moment the routine
mirroring of the main and backup raidsets was running ... We ended up
with a server that came up without a problem but was unable to find the
external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3
raidsets with 6 disks and 6 hot spares, each raidset one unit: main
D100, mirror data D200, mirror web D300). HSZ lights were showing
operative condition: channel leds off, reset light blinking.
b) Diagnosis:
tried to mount the main unit manually - fail. Checked /etc/fdmns -
domains missing. Checked /dev/rrz17 - rrz19 files - ok.
tried to connect to hsz using hszterm -f /dev/rrz17g - fail.
connected notebook to serial port of HSZ40.
SHOW THIS revealed
This controller has an invalid cache module
Controllers misconfigured. Type SHOW THIS_CONTROLLER
Power Supply failure cleared.
Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
Please - see user guide to determine corrective action
SHOW OTHER showed ok.
user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:
CLEAR_ERRORS INVALID_CACHE controller
Tried this, but in vain. Desperation. UARRRRGH!
Switching to offsite mirror, posting call for assistance to
tru64-unix-managers_at_ornl.gov, hopping around madly, lighting a candle,
praying.
Answer by Phil Baldwin showed that the syntax suggested by the user
guide was simply wrong. The correct syntax was:
CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or
[nodestroy_unflushed_data]
Applying this - ok.
Connecting notebook to defective controller (!!!), did SET THIS
NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER (Dangerous -
dont confuse the controllers here - COPY=[SOURCE] !!!
ok, controllers back online.
Show raid full: ok.
Show units full:
LUN Uses
--------------------------------------------------------------
D100 R1
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 41879900 blocks
D200 R2
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 20539825 blocks
D300 R3
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
INOPERATIVE
Unit has lost data
PREFERRED_PATH = THIS_CONTROLLER
WRITE_PROTECT - DATA SAFETY
Size: 20539825 blocks
Cache battery charge is low
OK, have to bring the units to operative state again.
Solution:
CLEAR_ERRORS LOST_DATA unit-number
brought them back to operative state.
All data and all sets ok. No further problems.
Have to figure out the problem with the powerfail shutdown script anyway
- I guess the system should come back up in stable condition after the
shutdown initiated by xpowerchute.
Thanks to all who replied and helped!
CW
Received on Mon Dec 20 2004 - 07:22:00 NZDT