Greetings. We have an Alpha 4100 running DU4.0D, attached to an ESA10000
containing several RAID5 logical disks. These disks are configured as
ADVFS domains to give us the benefits of fast recovery at reboot, since
the largest logical disk is 20GB.
Last night a disk in the 20GB raidset failed, and was replaced by a hot
spare, and the raidset was rebuilt. From a hardware perspective, all looks
to be fine. However, at the time it was rebuilding ADVFS logged 8 I/O errors
(mix of reads and writes) over a period of 3 seconds (!!!) and then it failed
with the errors:
Mar 10 18:26:01 myhost vmunix: bs_osf_complete: metadata write failed
Mar 10 18:26:01 myhost vmunix: AdvFS Domain Panic; Domain u05_dmn Id
0x363085d2.000940b8
Mar 10 18:26:01 myhost vmunix: An AdvFS domain panic has occurred due to
either a metadata write error or an internal inconsistency. This domain is
being rendered inaccessible.
The logical disk that failed held an oracle database, and oracle promptly
crashed. We didn't know anything about it till this a.m., which is another
whole problem, but when I came in a "df" showed the disk, but when you
tried to "cd" to it or do a "ls" on it it said "no such device" - NOT a
good thing. However, when you looked at the unit with "hszterm" it was not
"reduced" nor "recovering", but "normal" - hardware was fine.
I had to stop oracle, dismount the disk, and remount it, and we were then
able to restart oracle.
This is obviously not acceptable behavior. Does anyone know:
* options within advfs that would prevent it from failing the device
so quickly?
* options within advfs that would get it to try recovering on a failure?
* alternatives, other than ASE or TRUcluster, that would give us
the quick recoverability of ADVFS at boot time while still assuring
integrity of data?
Any other suggestions for dealing with this?
TIA.
--
Judith Reed
jreed_at_appliedtheory.com
(315) 453-2912 x335
Received on Thu Mar 11 1999 - 15:33:01 NZDT