Original Question/Problem:
This is one of the stranges problems I've ever encountered. I have an
AlphaServer 4000 with 2 pairs of dual-redundant HSZ70's. Each pair of
controllers has five (5) RAID5 raid sets, bringing my total number of
logical devices to ten. At some point the integrity of one of the raid
sets (it happens to be a mount point we call /stripe9) came into question
based on the database we're running.
To make a long process of troubleshooting short, I whittled the problem
down to a UNIX copy (cp) command from any of my logical devices to any
other working fine. (With one important exception!) No errors are reported
in the cp command and if I compare the source and destination files using
cmp there are no differences.
However if I copy from any logical device to /stripe9 the copy completes
successfully, BUT a cmp almost always shows that the files are NOT
identical! No errors are being reported by my controllers. I have only seen
one error show up in DECevent and admittedly it does seem to be pointing to
a problem with /stripe9, although the error message seemed to indicating it
was a self-correcting problem.
Only 2 replies. Thanks to:
Terry Horsnell
Balaji
Digital F/S eventually led me to getting the problem fixed, but I don't
know if you can call this a "solution". This is the kind of problem that
keeps poor System Manager's awake at night:
Digital recommended patching the system and upgrading to 2.8 of DECevent in
hopes of capturing more info. The former didn't fix the problem; the later
didn't log any more events. (Note that the insideous aspect to this problem
was that errors were NOT being logged! A copy would apparently succeed - no
errors were generated, yet the input/output files were DIFFERENT!)
Anyway as I'd mentioned above there was one error in the DECevent log.
Digital f/s suggested replacing this drive. (The drive being one of six
drives in the /stripe9 RAID5 raidset.) Anyway because there were several
sparesets available field service just pulled out the suspect drive. After
the raid set regenerated itself (I wanted to wait, just to be safe) the
problem went away.
Like I said, not the kind of problem that gives you the "warm and fuzzies".
This shouldn't happen - EVER!
Chris
cknorr_at_hops.com
305-827-8600 ext. 238 (voice)
305-827-0999 (fax)
Received on Mon Oct 05 1998 - 21:24:03 NZDT