PLATFORM: 2100 server (sable)
OS: Digital Unix 3.2
PROBLEM: LSM sporadically loses portions of a volume
Thanks to everyone who responded to the problem regarding our power
supply. The power supply had completely failed and required
replacement.
This question is long because we're really confused. We're getting
inconsistent symptoms and I've tried to include all info that might be
relevant. Thanks for all suggestions. Please reply to
reece_at_mbb.wustl.edu.
SYMPTOM
We're using LSM to combine disks and have been doing so without
problems for months. Recently (within the last week), we've started
losing portions of two of our volumes.
A typical symptom is that ls gives something like:
bash$ ls -l
Cannot access ./guitar: I/O error
Cannot access ./news: I/O error
Cannot access ./surfing: I/O error
total 36
drwxr-xr-x 6 reece user 1536 Sep 7 08:38 bin
drwxr-xr-x 2 reece user 1536 Sep 15 13:10 humor
drwxr-xr-x 8 reece user 512 Aug 7 10:51 img
SETUP
Our setup is that we've got 2 volumes (vol1 and vol2), each of which
consists of a 2-disk plex. All disks are on the same SCSI channel in
an external drive box.
# volprint -h vol1 vol2
TYPE NAME ASSOC KSTATE LENGTH COMMENT
vol vol1 fsgen ENABLED 9765864
plex vol1-01 vol1 ENABLED 9765864
sd rz0c-01 vol1-01 - 7812342
sd rz2c-01 vol1-01 - 1953522
vol vol2 fsgen ENABLED 9765880
plex vol2-01 vol2 ENABLED 9765880
sd rz1c-01 vol2-01 - 7812342
sd rz3c-01 vol2-01 - 1953538
DETAILS WHICH MAY BE IMPORTANT
* If volprint shows the volume to be ENABLED, then rebooting seems to
solve the problem.
* Rarely (twice) the volume becomes DISABLED and it's always been
because disk vol1-01 (rz0c) is disabled. PROM-level show devices
can't find the disk. Shutting off disk drive power and booting from
power-off state seems to work.
* We lose vol1 much more frequently than vol2.
* Apparently no data is lost.
* All of these disks are on the same SCSI channel.
* Both file and directory entries are lost
* Which files and directories are "lost" (as in the above ls) is not
consistent from failure to failure and doesn't appear to be
isolated to only certain directory hierarchies.
* Files and directories which do not cause errors in a directory
listing are not necessarily accessible. cat'ing such a file may or
may not result in an I/O error.
* The problem seems to be occuring with increasing frequency.
* The computer was recently moved (~1 week before symptoms) to a new
room which is notably warmer than the previous location. I'd
estimate it's 25C/78F.
* It seems that failure has progressive onset.
HYPOTHESES
* disk drive failure causes problems on drives on same SCSI
chain?
* room too warm for extended usage?
SOLUTIONS TRIED
* swap cables, terminators
Thanks for sending all suggestions to reece_at_mbb.wustl.edu. A spare
4Gb disk would be nice too.
--
Reece Kimball Hart | email: reece_at_dasher.wustl.edu
Biophysics & Biochemistry, Box 8231 | WWW: http://dasher.wustl.edu/~reece/
Washington Univ. School of Medicine | Phone: (314) 362-4198 (lab)
660 South Euclid | -7183 (fax)
St. Louis, Missouri 63110 (USA) | PGP public key available by finger & WWW
Received on Wed Sep 20 1995 - 23:09:56 NZST