I have an AlphaStation 500 functioning as an AFS server which has a
SWXCR KZPCS-BA 3channel PCI RAID controller, with a couple BA356 storage
work shelves hanging off of it. Am running v2.3 of the DU version of
Storage Works RAID Array 200 Software (and v3.3 of the standalone software).
I have had some experiences with parity errors which did not go as I expected,
and am wondering if it is me (and my expectations) or the software, and seek
comments and/or advice.
We run swxcrmon to email us when problems, etc occur. About a month or two
ago, I got some swxcrmon messages reporting a couple of parity errors on
the drives of one RAID 5 disk array. As everything was functioning normally
(other than the parity error message), I was not overly concerned, but
figured I should run swxcrmgr and do a "parity check and repair" operation
to fix the errors. I was rather unpleasantly surprised when towards the
end of the process the SWXCR failed two of the drives in the array, thereby
destroying all data on the RAID 5 array.
I was suspicious about the drives being bad, so swapped them with a pair of
drives in a raid 0 array that was for short term storage. Both raid arrays
were behaving OK. Then last week I got another swxcrmon message of parity
error on the RAID 5 array, and figured would try repairing parity while just
one error was reported, figuring maybe that would avoid the previous problem.
The parity check and repair claimed to have found several more parity errors,
in itself disconcerting, and completed without failing the disks. But the
statistics still showed errors, and I was not sure that any repair took place.
So I tried it again, and once more it failed two disks (one of the failed
positions was same as last time, but both physical drives were different).
Swxcrmgr shortly crashed and would no longer run, and eventually (ie around
5 minutes later) whole machine crashed and started a reboot. (I believe
something similar happened the first time as well.)
I was quite angry now, and really doubted I had 4 failed drives in a matter of
two months, particularly when always failing in "pairs". So after getting the
system to come up again, I used swxcrmgr to manually "make optimal" the two
"failed" drives, fsck'ed the file systems (a few errors, but simple ones that
the -p option would fix), and mounted the filesystems. The filesystems looked
fine --- the major directory structure was correct and some randomly selected
files looked ok. Could be some data damage, but does not look like a truly
bad disk drive. Parity check reveals no errors.
Am I wrong in assuming that parity check and repair is supposed to be a
relatively safe operation? (I understand that a parity error indicates some
data loss, and that by repairing it that loss may be made permament. But
basically, in blocks with parity error there's a 50% chance the data is
correct and parity wrong, in which case fixing parity should be fine, and
the other case is parity correct and data wrong, which means are losing data.
But that file was already "corrupted" on the filesystem, and the repair is
just making it impossible to use the parity information to correct it, which
is non-trivial anyway and probably easier to restore from backups.) Or am
I missing something and parity repair is a drastic procedure that should only
be run in extreme circumstances?
Has anyone else observed similar behavior in parity repairs, or know of any
bug fixes, upgrades to the SWXCR controller or software? I could not find
anything that seemed relevant on Compaq's web site.
What is the recommended procedure for dealing with an occasional parity error
on a RAID array? Replacing the drive seems like overkill.
I am a bit concerned about the rate of parity errors. I have three RAID 5
arrays on three different machines, and this is the only one for which swxcrmon
has been reporting parity errors, and this only in the past couple of months.
The rate is not high (one or two a month) except when compared to the other
shelves and past history of this shelf. It is also disconcerting that running
a parity check reveals errors that swxcrmon did not appear to report to me,
boosting the rate to perhaps one or two a week. These are scattered among the
seven drives in that array, and so does not appear to be a simple case of a
drive beginning to go bad (or is it--- if a single drive is acting up, it will
get parity errors assigned to it when it holds the parity for a stripe, but
it may also corrupt data for stripes which would cause parity errors to be
assigned to the parity disks for those stripes? But how can I identify the
misbehaving drive?) I am also concerned if it is a problem with the shelf,
I/O card, or even the SWXCR card (there are two other shelfs hanging off the
card, but they hold RAID 0 or JBOD disks, so no parity errors reported on
them)? Any suggestions?
Thanks in advance. Please email responses to me and I will post summary.
Tom Payerle
Dept of Physics payerle_at_physics.umd.edu
University of Maryland (301) 405-6973
College Park, MD 20742-4111 Fax: (301) 314-9525
Received on Tue Jun 27 2000 - 16:02:05 NZST