The following people sent me insights/infos on this one (original post at
end of this message). Thanks for their replies:
alan_at_nabeth.cxo.dec.com
Ronald D. Bowman <rdbowma_at_tsi.clemson.edu>
Fliguer, Miguel <M_Fliguer_at_miniphone.com.ar>
Alan had the most complete answer. Here it is:
re: Repair
Not likely. There is a salvage utility for AdvFS that may
be able to recover most data. Beyond that the repair
utilities aren't up to what fsck can do on UFS. There
are probably still cases when the only thing that can be
done is recover from a backup.
re: setting devices optimal.
Probably a bad idea, but the controller and software should
have come with documentation which should describe how safe
it is. RAID-5 can survive the loss of one member per array.
It can't survive more. The power loss came during the middle
of an I/O or before any cache was flushed to disk, then the
array is inconsistent. The problem with such inconsistencies
is that you can't tell whether it is the data that is wrong
or the parity. If the parity is blindly repaired without
knowing, then all you've done is make the parity consistent
with potentially bad data.
re: Panic the system.
Early versions of Digital UNIX paniced the whole system
when there was a file system inconsistency. For the most
part this was fixed in V4.0 with only the particular domain
being paniced. This shouldn't take down the whole system,
but there may be panics that were missed or ones that have
AdvFS enough to warrant taking everything down.
**********************
Miguel shares his experience on ADVFS crashes:
There are rumors about a new utility in 4.0D called 'salvage'
that could be useful to recover severely damaged advfs
domains. I still couldn't use it since I don't have 4.0D yet
(and I don't know if it will work on 4.0B)
FWIW, last week we had a similar occurence of a power
outage on half the disks of a 8400 (they were on a separate
storage cabinet on a separate HSZ50...). The system panicked.
This was under 3.2G, so I expected BIG trouble. To my surprise,
after rebooting everything went fine, even our Ingres database,
(whose data files were on the affected arrays). I guess
we were lucky ...
May your power be stable from now on...;-)
Regards,
*********************
The bottom line is: while ADVFS offers faster crash recovery than UFS,
there are crash conditions that may render ADVFS file systems unuseable,
with few (if any) tools to help recover data. Keep those disk arrays
POWERED UP while the system is running!
***********ORIGINAL POST FOLLOWS*******************
Hello DU-admins,
First, the environment: AlphaServer 2100 5/300 model 600, DU 4.0b
(unpatched, I believe), 2 PCI SWXCR RAID controllers.
The problem:
Yesterday, an operator accidentally cut the power off to half of our
BA0-350's, while the system was working. Seeing what he just did, he
plugged the master switch power cord back... Not a good idea, but that's
what he did!
Following this, on each of the two RAID controllers, all the disks on one
channel were failed (among others, were the disks holding the root
partition), so the system would not even boot.
I fired the SWXCRMGR utility from the ARC console, and managed to get the
failed disks OPTIMAL again, by restoring the configuration to the
controller from a file on diskette. Yes, it was the exact config, no
messing things up.
While booting the system back to normal, we experienced a panic while
mounting advfs partitions, with the effect of rebooting the server. I
booted into single -user mode, and began mounting advfs filesets manually,
one by one. I quickly found the problematic fileset while mounting it, when
the system panicked again, with the following message:
ADVFS EXCEPTION
module=ftx_recovery.c, line=600
ftx_recovery_pass: bad log sts
N1=0
panic (cpu 0): ftx_recovery_pass: bad log sts
N1=0
syncing disks... done
DUMP: No primary swap, no explicit dumpdev.
Nowhere to put header, giving up.
halted CPU 1
halted CPU 2
halted CPU 0
halt code=5
HALT: instruction executed
PC=fffffc0000505300
cpu 0 booting
Since this fileset was contained on a RAID 5 logical group, I thought
running a parity check on the group could correct errors and get things
back to normal. So I checked the logical group. Errors we found and
succesfully corrected, but trying to mount the offending fileset still
rebooted the server.
Whatever ADVFS command I tried on the bad file_domain#file_set ended with
this same ADVFS EXCEPTION: mount -t advfs, showfdmn, showfsets, verify.
Only advscan could be used without having the server rebooting. It was able
to locate the ADVFS partition on the faulty logical group and worked as
normal. I also tried writing a new disklabel to the logical drive
(disklabel -wr /dev/rre19c SWXCR), to no avail.
Seeing no quick solution to this problem, I resorted to erasing the faulty
file domain and re-creating it, followed by a restore from the previous
night backup (8 Gb of ARC/Info data files, and half a day worth of work lost)
This unpleasant experience raises questions:
- Could I have done something else to repair the corrupt file domain?
- Was setting failed drives back to optimal status a good idea? What
alternative did I have?
- Is it normal that a problem with an ADVFS fileset unrelated to the OS
itself (eg. not containing the root partition, nor /usr, nor a swap
partition) can bring down the whole system?
Thanks in advance for any comments, suggestions, shared experiences.
--
Charles Vachon -- Administrateur de système
Fonds de la réforme cadastrale du Québec
Ministère des Ressources Naturelles du Québec
cvachon_at_mrn.gouv.qc.ca
Received on Mon Apr 27 1998 - 15:55:30 NZST