SUMMARY / UPDATE : HSZ 40 / DEC Alpha Cluster / Problem after power failure from Christian Wessely on 2004-12-20 (tru64-unix-managers)

From: Christian Wessely <christian.wessely_at_uni-graz.at>
Date: Mon, 20 Dec 2004 08:19:43 +0100

Since several users requested it, update of summary containig the
complete procedure follows:

a) Problem:
Power failure longer than the connected UPS could stand - 30 minutes.
After 20 minutes, the UPS software initiated a shutdown; unfortunately,
the shutdown was not completed and at that very moment the routine
mirroring of the main and backup raidsets was running ... We ended up
with a server that came up without a problem but was unable to find the
external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3
raidsets with 6 disks and 6 hot spares, each raidset one unit: main
D100, mirror data D200, mirror web D300). HSZ lights were showing
operative condition: channel leds off, reset light blinking.

b) Diagnosis:
tried to mount the main unit manually - fail. Checked /etc/fdmns -
domains missing. Checked /dev/rrz17 - rrz19 files - ok.
tried to connect to hsz using hszterm -f /dev/rrz17g - fail.

connected notebook to serial port of HSZ40.
SHOW THIS revealed
This controller has an invalid cache module
Controllers misconfigured. Type SHOW THIS_CONTROLLER
Power Supply failure cleared.
Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
Please - see user guide to determine corrective action

SHOW OTHER showed ok.

user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:

CLEAR_ERRORS INVALID_CACHE controller

Tried this, but in vain. Desperation. UARRRRGH!
Switching to offsite mirror, posting call for assistance to
tru64-unix-managers_at_ornl.gov, hopping around madly, lighting a candle,
praying.
Answer by Phil Baldwin showed that the syntax suggested by the user
guide was simply wrong. The correct syntax was:

CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or
[nodestroy_unflushed_data]

Applying this - ok.
Connecting notebook to defective controller (!!!), did SET THIS
NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER (Dangerous -
dont confuse the controllers here - COPY=[SOURCE] !!!

ok, controllers back online.
Show raid full: ok.
Show units full:
    LUN Uses
--------------------------------------------------------------
   D100 R1
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 41879900 blocks
   D200 R2
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
   D300 R3
         Switches:
           RUN NOWRITE_PROTECT READ_CACHE
           WRITEBACK_CACHE
           MAXIMUM_CACHED_TRANSFER_SIZE = 32
         State:
           INOPERATIVE
           Unit has lost data
           PREFERRED_PATH = THIS_CONTROLLER
           WRITE_PROTECT - DATA SAFETY
         Size: 20539825 blocks
Cache battery charge is low

OK, have to bring the units to operative state again.
Solution:
CLEAR_ERRORS LOST_DATA unit-number

brought them back to operative state.
All data and all sets ok. No further problems.

Have to figure out the problem with the powerfail shutdown script anyway
- I guess the system should come back up in stable condition after the
shutdown initiated by xpowerchute.

Thanks to all who replied and helped!
CW
Received on Mon Dec 20 2004 - 07:22:00 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT