I think my advfs system disk is failing. What to do? from Robin Kundert on 1996-01-11 (tru64-unix-managers)

From: Robin Kundert <rkundert_at_paul.spu.edu>
Date: Wed, 10 Jan 1996 11:58:39 -0800 (PST)

History
-------
Just before Christmas we installed a bunch of patches that DEC said would
bring our Digital Unix V3.2b system up to current patch levels (almost
3.2c).

Problems
--------
The crashes that we were experiencing before have gone away, but we are
seeing a whole bunch of advfs errors that look like this:

Jan 7 19:35:21 paul vmunix: advfs I/O error: setId 0x305b6a24.000ee0b0.1.8001 tag 0x000007e1.83f0u page 4
Jan 7 19:35:24 paul vmunix: vd 1 blk 38384 blkCnt 16
Jan 7 19:35:24 paul vmunix: read error = 5

The 'advfs' errors are becoming a BIG problem. These messages have been
occurring several times a day, often a couple of the messages occur within
seconds of each other and the frequency seems to be increasing. Since
installing the patches we've accumulated a lot of those errors, have not
(to the best of my knowledge) been able to reboot without doing a hardware
reset, and have had at least 3 files corrupted (/sbin/init, /etc/passwd,
and /var/adm/binary.errlog).

Perhaps related to this, we have been seeing mysterious behavior that was
described by another of our system guys here at SPU as the system going
into single user mode "out of the blue". If anything, it would be nice if
we could know what the 'advfs' error means.

Theories, Observations, and Questions
-------------------------------------
   o Have the patches introduced other problems?
   o Is a disk going bad (my favorite)?
   o Have the patches revealed an existing problems that was masked before?

All of the trouble seems to be on filesets that involve one suspect disk.
I used 'scu' to examine the defect list for the suspect (an RZ26L) and
found that the defect list has 'grown' from 71 to 299 entries. Everything
except the swap partitions (including the root partition) is under the
control of 'advfs'. I'm hoping that since the root partition is 'advfs',
we'll be able to migrate off of that filing drive to a different one with
minimal downtime. I have not been able to find documentation that says I
can actually do so or that explains precisely how to do it.

   o Can we add another partition to the root fileset and migrate
     everything over there?
   o If we can, how do we convince the system to boot from the other disk?
   o What can we do to facilitate recovery of the root partition in case
     the whole this dies before we can arrive at a "kinder" solution?
   o Are there any tools that would tell me if a piece of hardware (disk)
     is really going bad?
   o How are bad and failing disk blocks handled by the drive and/or
     Digital Unix?
   o Is it true that the disk drives relocates failing blocks?
   o If that is true, how do you tell from the error logs what actions
     the disk drive has taken and how do you distinguish a successfully
     recoved block from a corrupt block?
   o What should be considered an acceptable rate or level of 'grown' bad
     blocks?
   o Should I consider LSM or LVM instead of or in addition to
     'avfs' in the future?

Needless to say I'm worried. Any advise you can offer would be
appreciated.

  * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Robin D. Kundert rkundert_at_spu.edu *
* Seattle Pacific University (206) 281-2507 *
* Computer & Information Systems, 3307 3rd Ave. West, Seattle, WA 98119 *
  * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Received on Wed Jan 10 1996 - 21:24:24 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT