SUMMARY: disk I/O error, bad sector? from Ronald D. Bowman on 1998-06-10 (tru64-unix-managers)

From: Ronald D. Bowman <rdbowma_at_tsi.clemson.edu>
Date: Tue, 09 Jun 1998 17:18:57 -0400

I want to the following for their replies:
Alan Rollow alan_at_nabeth.cxo.dec.com
Pat CHANTHP_at_POLAROID.COM
Dr. Tom Blinn <tpb_at_zk3.dec.com>

Both Alan and Dr. Blinn provided excellent information(as usual).

my original post is rather long(and so is the summary), but in
short we were having a problem with NSR backing up some newly
added files. It appeared that there may be a tape problem;
however using dump resulted in errors indicating a hard drive
problem associated with reading files from the drive.

Solution(s):
        Just to be safe, we cleaned the tape drive. But as we found
        out with more work that the error was actually with the hard drive.

        Alan pointed out the following:
         "This looks like an error on the tape.
        NSR will stop using a tape at the first error."

        So, evidently what was happening is that when NSR tried to read
        from the disk and received a read error, NSR decided to quit and
        mark the tape as full- even when this was the first backup on the
        tape.. Actually NSR gave errors for 6 files that it could not read.


        Thinking the error was on the hard disk, a search of the
        archives resulted in finding a summary on an identical problem posted
        by Colin Brooks in Nov. of 1997. I was not 100% sure if I understood
        what was going on - so I made my post. From the previous summary and
        responses received, I came up with a plan of how to solve my problem.

        Tools to use:
        icheck, ncheck, scu and fsck. I found scu to be the most helpful, and
        could not get some of the others to provide the information I thought
        that they were capable of. Our error was in user data files, so fsck
        could not help since according to Alan it only reads file system data structures.
        unfortunately, icheck did not provide any information on bad blocks -
        why I do not know.


        In order to keep people from using the partition where the errors were
        occurring, I unmounted it by using #umount /space where /space is the
        partition in question(we knew this since it failed during the dump, plus
        this is the partition to which we had just added about 270 Meg of files).
        Also, in order to use fsck, the file system must be unmounted.

        Then using #scu -f /dev/rrz0h, scu> verify media I found that 8 blocks
        were unreadable. By accident, I discovered that just using verify media
        without any block information checked the entire disk(fortunately the only
        errors were in the /space partition).

        Then scu was used to reassign the blocks that were unreadable:
        scu> reassign lba #### where #### is the block number provided by verify.

        We could do this with little worry about what happened to our data since
        the partition in question has software that we have added(thus easy to
        recover). Plus, the errors did not show up until this latest installation
        of software, and NSR gave us a list of 6 files that had I/O errors
        associated with them. Therefore, more than likely the 6 files listed
        by NSR were the only ones in jeopardy.

        I then re-ran the scu verify command as scu>verify media starting ####
        where #### was a block number starting just before the first reported
        error. No errors were reported, so remounted /space using #mount /space.

        Running dump on the /space partition was successful, so
        as far as I can tell, this has completely solved the problem for now.
        Once the files are backed up via NSR, will we know if the problem is
        solved for now. The other interesting fact is that scu> show defects grown
        results in a list of 14 (now) bad sectors. The 7 of the 8 added were in
        consecutive cylinders each with the same head # and sector #. Furthermore,
        there were already 6 other entries with cylinder numbers that were consecutive or
        that fit in with the ones we added, and these all had the same head # and sector #
        as the ones we added.

        The other commands provided some information, but I could not use any of
        it to help with the problem. ncheck -i /dev/rz0h provided the inode associated
        with each file in the /space partition. icheck did not help in any way, and
        fsck -o provided about the same information as icheck. I have a feeling I did
        not know how to use them to receive the necessary information.

        Thanks again for the help and sorry about the length of this summary.

Ron Bowman
Techno-Sciences, Inc.
rdbowma_at_tsi.clemson.edu
864-646-4028

Alpha EB 21164, 333MHz, 1 CPU
DU 4.0B (564) Patch #6 installed
Searchable Archive URLs:
http://www-archive.ornl.gov:8000/ (simple search)
http://www-archive.ornl.gov:8000/archive/power.htm (more detailed)
The following is a summary only site graciously maintained by Matt Moore.
http://www.bucks.edu/alpha-osf-managers/
Received on Tue Jun 09 1998 - 23:20:03 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT