SUMMARY: AdvFS I/O Errors from Andy Cohen on 2001-08-09 (tru64-unix-managers)

From: Andy Cohen <Andy.Cohen_at_cognex.com>
Date: Wed, 08 Aug 2001 14:49:22 -0400

Hi,

I had requested help with AdvFS I/O errors that I was (still am) receiving
in the /var/adm/messages file (and presumably the binary.errlog). See
original post at bottom.

The consensus is that it is a bad block on the drive. There was no
'silver-bullet' solution just a number of tools and techniques that varied
from attempting to reassign the bad block to restoring from a backup. Most
agreed that a new drive would probably be in order. My short-term solution
was to simply rename the directory (e.g., mv /files /files_corrupted) and
recreate the directory (mkdir files) and copy in the files from their
original source (lucky for me they were on another machine). I did run
verify (from single-user mode) which did not fix it. I will try running
'verify -f' and 'verify -d' the next time I'm able to shutdown the machine.

Here are the responses:

============================================================================
======
usually for me this type of error is a hardware error with the disk drive.
============================================================================
======
It looks like a bad disk .. it has bad blocks on it .. but still you can
comfirm checking with

#dia -R |more

Look what the errors say whether its a hard error or some bad blocks
developing
on them ..you may have to replace the drive. .
============================================================================
======
Most likely these times correspond to either defragmentation (crontab for
root) and/or backup.

This is most likely because the disk is bad, since too little space on the
remaining devices gives another error message.

Try making a backup before you do anything, then run a verify (which will
probably telly you that the domain is corrupt).
Depending on what files are corrupt you might want to consider doing a
restore from the last good backup, or as a last solution try recovering the
files.

The right approach is to get a new and good disk, or try to mark the
problematic areas bad (try doing a low level format with scu)
============================================================================
======
You've got the volume, which should tell you which domain you're
dealing with. For each fileset on the domain, you've got a "tags"
directory. There are tools you can use (or manual methods) that
will let you find which FILE in WHICH FILESET contains the bad
block. If it's a user file, you need to move it ; you may not
be able to read all the data. Once you make a COPY of the file
(e.g., to a backup tape or another file system), remove the old
file. AdvFS KNOWS that the page/block is bad and won't use it
for any other file in the future.

If it's AdvFS metadata, then you may need to back up the entire
file system (all filesets) and recreate the domain to remove the
disk. Until you remove the disk, you can't replace it with a new
one. And you don't need to do so once you get AdvFS to simply not
use the known bad spot any longer.
============================================================================
======
        If the underlying devices supports it, you can force a
        reassignment of the bad block with a good one using
        scu(8) (reassign lba). The problem is that will almost
        certainly corrupt the data. Of course the bad block
        may already be causing the domain to panic. The AdvFS
        management documentation may offer advice on what to
        do, but most likely you'll have to recreate the domain
        and restore from the last backup. If the backup is out
        of date, you can use salvage to get the data not affected
        by the error.

        Last I looked, the AdvFS management documention is in the
        documentation directory of the AdvFS utilities on one of
        the Associated Products CDROMs.
============================================================================
======
I don't think that using multiple partitions is a good idea to
prevent this situation from recurring. The negative tag (any tag
whose hex value leads with fffffff is negative) is an
internal-to-AdvFS metafile. The fact that it's messed up is why
you're getting the I/O error when you try to remove the domain. It's
a serious problem.

Using partitions other than (c) won't help if the same kind of
metadata corruption occurs again, since you would not be able to
rmvol the smaller partition, and would therefore be in the same state
you're in now.

I'd approach it the following way:

1. Unmount and run verify(8) on the domain, as you suggested.

2. Assuming it can fix the metadata, rmvol the volume on the bad disk
(which will only work if there's enough space on the other volumes in
the domain to hold the data on rz9c).

3. Replace the disk.

4. Mount the domain, and use addvol(8) to add the replacement disk to
the file domain.

5. Use balance(8) to balance the file domain, which will move files
onto the new volume.

If the verify fails, you will likely need to vdump the data on the
domain, dissolve the whole thing, and re-create it with a new disk in
place of the bad one on rz9c.
============================================================================
======
If it can't fix the metadata you're basically stuck. Try a verify
-f anyway. If it doesn't help, you'll have to dissolve the WHOLE
domain and re-create it, after backing up the data. vdump will only
read the actual files, not the metadata, so it should be OK, as long
as the files are readable.

I think your best bet at that point is:

1. Try the verify -f.
2. Regardless of the result, vdump the file domain to tape just in case.
    Use the -x switch to vdump to compute CRCs on every eight data blocks
placed
    on the tape. That way you've got some protection against tape errors
when
    reading the saveset back in when you do the restore.
3. Try the rmvol on the bad disk. If it succeeds, replace the disk and
    use addvol to add the new disk to the domain and stop -- you're done.
    If the rmvol fails again, proceed with steps 4 through 8 below.

If the rmvol(8) fails:
4. Dismount the file domain. Then dissolve the entire file domain with
    rmfdmn(8).
5. Replace the bad disk.
6. Create a new file domain with mkfdmn(8), use addvol(8) to add in the
    other disks that comprised the domain.
7. Make the fileset (mkfset(8)).
8. Mount it up and vrestore the tape back to the new domain.

If the rmfdmn in step 4 fails, you can work around it as follows:

A) Make sure all filesets in the domain (if more than one) are dismounted.
B) Remove the directory in /etc/fdmns for the file domain.
C) Zap the disklabels on the other disks in the domain to remove all
evidence
    that they have AdvFS volumes on them:
        # disklabel -z rzxx
        # disklabel -wr rzxx
    NOTE: This step assumes you're just using the "c" partition on the other
    drives in the domain, and don't care about the others. Don't do this if
    you're using more than one partition on the other drives.
D) Resume with step 5 above.
============================================================================
======
There may some way to remove them through the "tags" file naming, I'm
not an expert on dealing with AdvFS corruptions. If the problem really
is in the metadata, you may not be able to remove them. If it's just
the files themselves, then if they weren't large (you should only be so
lucky) you might just create a "junk" directory and move them into it,
with new names ("junk001", etc.) and just leave them there, and restore
the data from your backup copies. Other than that, you might have to
re-create the domain and filesets from scratch to get the bad disk out
of use.
============================================================================
======

Thanks everybody for your help,
Andy

Andy Cohen
Database Systems Administrator
Cognex Corporation
1 Vision Drive
Natick, MA 01760
----------------------------------------------
e-mail: andy.cohen_at_cognex.com
voice: 508-650-3079
  cell: 617-470-0034
   fax: 508-650-3337

ORIGINAL POST:
Hi,

In the /var/adm/messages file we're receiving a number of the following
messages:

Aug 1 04:03:15 sif vmunix: AdvFS I/O error:
Aug 1 04:03:15 sif vmunix: Volume: /dev/rz9c
Aug 1 04:03:15 sif vmunix: Tag: 0xfffffff4.0000
Aug 1 04:03:15 sif vmunix: Page: 763
Aug 1 04:03:15 sif vmunix: Block: 13296
Aug 1 04:03:15 sif vmunix: Block count: 16
Aug 1 04:03:15 sif vmunix: Type of operation: Read
Aug 1 04:03:15 sif vmunix: Error: 5

It started two nights ago and when it occurs (e.g., 1:05 am, 4:03 am) it
only lasts for a few minutes then doesn't happen again until the next night
(morning). This device (/dev/rz9c) is part of our home_domain. I tried
issuing:

rmvol /dev/rz9c home_domain

and after several minutes received:

rmvol: Can't get volume file descriptors
rmvol: Error = I/O error
rmvol: Can't remove volume '/dev/rz9c' from domain 'home_domain'

I've never had to deal with these sorts of errors and was wondering what the
best approach would be. I thought I'd try running verify on this domain. I
see that I have to unmount it first (which means I have to kick all the
users off) so I'll try that after hours. Anything else I can do? The block
number seems to be the same each time so I'm wondering if it's just a bad
block. When I goto use /dev/rz9 again I think I'll use partitions (e.g.,
a,b,d,e,f) instead of the whole thing (partition c) so that if this error
occurs I can just remove the partition that this error is occuring on. Is
that the right approach?
Received on Wed Aug 08 2001 - 18:50:21 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT