Hello Alpha Managers,
What have I done wrong??
We have a AlphaServer 1000 with V3.2C and KZPSA 3 channel Raid Controller.
configured with the following RAID groups
1) RAID 5 - 7 x RZ29VW
2) RAID 5 - 3 x RZ29VW
3) RAID 0 - 2 x RZ28VA
The RAID 0 set is the system disk. All filesystems are UFS and Write Through.
We are about to have a disk re-organisation and I have 2 RZ28VW that I
thought I would use to replace the RZ28VA disks. I tried to carry out
a replacement with the system up as follows ( I should add that there
was heavy I/O on the 2 RAID 5 systems and the systems other 2 SCSI
busses throughout this but little activity on the RAID 0. )
Added a new RZ28VW to the storageworks shelf
Used the online GUI utility to define it as a hot spare
Marked one of the RZ28VA as FAILED
Removed the "Failed" drive from the shelf expecting it to be rebuilt
onto the RZ28VW
However got a mail that disk had failed and was NOT being rebuilt on Hot spare -
no reason given ( Fault management is enabled )
Decided not to risk things further - would put things back as they were
until I could shut the machine down.
Tried to unmark the RZ28VW as the host spare - entire system froze.
At the console ( A VT220) saw a stream of XCR I/O error messages and
NO disk activity.
Tried reset button - caused system to panic - but unable to write dump
to anywhere. On booting hung Waiting for dra0.0.0.13.0 to poll.
Power cycled system - ditto
Power cycled system and disks - booted OK.
Stupidly again tried to unmark Hot Spare - went round the above loop again!!
Oh well leave this hot spare alone just put the other RZ28VA back in
and let it rebuild. So put disk in but no rebuild started. ( possibly
did not wait long enough?) fired up online utility and manually
started rebuild. Disk activity showed rebuild was now going.
5 mins later system froze in same way as before. However both system
disk now had failed light on and reboot hung after
CPU 0 booting
Cycling power did not help until I cycled the disks power while the
server was hung in this state Urggh :-(
Now able to get to >>> prompt
Fired up standalone SXCXMGR from VT220 - but found that although
cursor keys worked in the main menu they did not on the small
confirmation YES NO menu so could not use utility.
Tried SRLMGR but was the same.
Went and found a keyboard and monitor and used system console to run SWCXMGR
and marked the good system disk as OPTIMAL and was able to reboot the system.
So What went wrong How can I avoid it again? How do I safely rebuild
my still "Failed" disk?
Is it really only safe to do rebuild on an idle system or one that is
down with the standalone utility?
Any Help Welcome. I will open a call with DEC tomorrow.
Thanks
Tim.
Tim Janes | e-mail : janes_at_signal.dra.hmg.gb
Defence Research Agency | tel : +44 1684 894100
Malvern Worcs | fax : +44 1684 895103
Gt Britain | #include <std/disclaim.h>
Received on Wed Sep 11 1996 - 00:33:43 NZST