advfs problem after power interruption from Charles Vachon on 1998-04-23 (tru64-unix-managers)

From: Charles Vachon <cvachon_at_mrn.gouv.qc.ca>
Date: Wed, 22 Apr 1998 15:16:59 -0400

Hello DU-admins,

First, the environment: AlphaServer 2100 5/300 model 600, DU 4.0b
(unpatched, I believe), 2 PCI SWXCR RAID controllers.

The problem:

Yesterday, an operator accidentally cut the power off to half of our
BA0-350's, while the system was working. Seeing what he just did, he
plugged the master switch power cord back... Not a good idea, but that's
what he did!

Following this, on each of the two RAID controllers, all the disks on one
channel were failed (among others, were the disks holding the root
partition), so the system would not even boot.

I fired the SWXCRMGR utility from the ARC console, and managed to get the
failed disks OPTIMAL again, by restoring the configuration to the
controller from a file on diskette. Yes, it was the exact config, no
messing things up.

While booting the system back to normal, we experienced a panic while
mounting advfs partitions, with the effect of rebooting the server. I
booted into single -user mode, and began mounting advfs filesets manually,
one by one. I quickly found the problematic fileset while mounting it, when
the system panicked again, with the following message:

ADVFS EXCEPTION
module=ftx_recovery.c, line=600
ftx_recovery_pass: bad log sts
N1=0
panic (cpu 0): ftx_recovery_pass: bad log sts
N1=0
syncing disks... done
DUMP: No primary swap, no explicit dumpdev.
Nowhere to put header, giving up.
halted CPU 1
halted CPU 2

halted CPU 0
halt code=5
HALT: instruction executed
PC=fffffc0000505300

cpu 0 booting

Since this fileset was contained on a RAID 5 logical group, I thought
running a parity check on the group could correct errors and get things
back to normal. So I checked the logical group. Errors we found and
succesfully corrected, but trying to mount the offending fileset still
rebooted the server.

Whatever ADVFS command I tried on the bad file_domain#file_set ended with
this same ADVFS EXCEPTION: mount -t advfs, showfdmn, showfsets, verify.
Only advscan could be used without having the server rebooting. It was able
to locate the ADVFS partition on the faulty logical group and worked as
normal. I also tried writing a new disklabel to the logical drive
(disklabel -wr /dev/rre19c SWXCR), to no avail.

Seeing no quick solution to this problem, I resorted to erasing the faulty
file domain and re-creating it, followed by a restore from the previous
night backup (8 Gb of ARC/Info data files, and half a day worth of work lost)

This unpleasant experience raises questions:

- Could I have done something else to repair the corrupt file domain?

- Was setting failed drives back to optimal status a good idea? What
alternative did I have?

- Is it normal that a problem with an ADVFS fileset unrelated to the OS
itself (eg. not containing the root partition, nor /usr, nor a swap
partition) can bring down the whole system?

Thanks in advance for any comments, suggestions, shared experiences.

--
Charles Vachon -- Administrateur de système
Fonds de la réforme cadastrale du Québec
Ministère des Ressources Naturelles du Québec
cvachon_at_mrn.gouv.qc.ca

Received on Wed Apr 22 1998 - 21:40:56 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT