LSM sporadically loses portions of a volume

From: Reece Kimball Hart <reece_at_bmb.wustl.edu>
Date: Wed, 20 Sep 1995 15:45:32 -0500 (CDT)

PLATFORM: 2100 server (sable)
OS: Digital Unix 3.2
PROBLEM: LSM sporadically loses portions of a volume

Thanks to everyone who responded to the problem regarding our power
supply. The power supply had completely failed and required
replacement.

This question is long because we're really confused. We're getting
inconsistent symptoms and I've tried to include all info that might be
relevant. Thanks for all suggestions. Please reply to
reece_at_mbb.wustl.edu.


SYMPTOM
We're using LSM to combine disks and have been doing so without
problems for months. Recently (within the last week), we've started
losing portions of two of our volumes.

A typical symptom is that ls gives something like:
bash$ ls -l
Cannot access ./guitar: I/O error
Cannot access ./news: I/O error
Cannot access ./surfing: I/O error
total 36
drwxr-xr-x 6 reece user 1536 Sep 7 08:38 bin
drwxr-xr-x 2 reece user 1536 Sep 15 13:10 humor
drwxr-xr-x 8 reece user 512 Aug 7 10:51 img


SETUP
Our setup is that we've got 2 volumes (vol1 and vol2), each of which
consists of a 2-disk plex. All disks are on the same SCSI channel in
an external drive box.

# volprint -h vol1 vol2
TYPE NAME ASSOC KSTATE LENGTH COMMENT
vol vol1 fsgen ENABLED 9765864
plex vol1-01 vol1 ENABLED 9765864
sd rz0c-01 vol1-01 - 7812342
sd rz2c-01 vol1-01 - 1953522

vol vol2 fsgen ENABLED 9765880
plex vol2-01 vol2 ENABLED 9765880
sd rz1c-01 vol2-01 - 7812342
sd rz3c-01 vol2-01 - 1953538


DETAILS WHICH MAY BE IMPORTANT
* If volprint shows the volume to be ENABLED, then rebooting seems to
  solve the problem.
* Rarely (twice) the volume becomes DISABLED and it's always been
  because disk vol1-01 (rz0c) is disabled. PROM-level show devices
  can't find the disk. Shutting off disk drive power and booting from
  power-off state seems to work.
* We lose vol1 much more frequently than vol2.
* Apparently no data is lost.
* All of these disks are on the same SCSI channel.
* Both file and directory entries are lost
* Which files and directories are "lost" (as in the above ls) is not
  consistent from failure to failure and doesn't appear to be
  isolated to only certain directory hierarchies.
* Files and directories which do not cause errors in a directory
  listing are not necessarily accessible. cat'ing such a file may or
  may not result in an I/O error.
* The problem seems to be occuring with increasing frequency.
* The computer was recently moved (~1 week before symptoms) to a new
  room which is notably warmer than the previous location. I'd
  estimate it's 25C/78F.
* It seems that failure has progressive onset.


HYPOTHESES
* disk drive failure causes problems on drives on same SCSI
  chain?
* room too warm for extended usage?


SOLUTIONS TRIED
* swap cables, terminators

Thanks for sending all suggestions to reece_at_mbb.wustl.edu. A spare
4Gb disk would be nice too.

--
Reece Kimball Hart                  | email: reece_at_dasher.wustl.edu
Biophysics & Biochemistry, Box 8231 | WWW:   http://dasher.wustl.edu/~reece/
Washington Univ. School of Medicine | Phone: (314) 362-4198 (lab)
660 South Euclid                    |                 -7183 (fax)
St. Louis, Missouri  63110    (USA) | PGP public key available by finger & WWW
Received on Wed Sep 20 1995 - 23:09:56 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT