I want to the following for their replies:
Alan Rollow alan_at_nabeth.cxo.dec.com
Pat CHANTHP_at_POLAROID.COM
Dr. Tom Blinn <tpb_at_zk3.dec.com>
Both Alan and Dr. Blinn provided excellent information(as usual).
my original post is rather long(and so is the summary), but in
short we were having a problem with NSR backing up some newly
added files. It appeared that there may be a tape problem;
however using dump resulted in errors indicating a hard drive
problem associated with reading files from the drive.
Solution(s):
Just to be safe, we cleaned the tape drive. But as we found
out with more work that the error was actually with the hard drive.
Alan pointed out the following:
"This looks like an error on the tape.
NSR will stop using a tape at the first error."
So, evidently what was happening is that when NSR tried to read
from the disk and received a read error, NSR decided to quit and
mark the tape as full- even when this was the first backup on the
tape.. Actually NSR gave errors for 6 files that it could not read.
Thinking the error was on the hard disk, a search of the
archives resulted in finding a summary on an identical problem posted
by Colin Brooks in Nov. of 1997. I was not 100% sure if I understood
what was going on - so I made my post. From the previous summary and
responses received, I came up with a plan of how to solve my problem.
Tools to use:
icheck, ncheck, scu and fsck. I found scu to be the most helpful, and
could not get some of the others to provide the information I thought
that they were capable of. Our error was in user data files, so fsck
could not help since according to Alan it only reads file system data structures.
unfortunately, icheck did not provide any information on bad blocks -
why I do not know.
In order to keep people from using the partition where the errors were
occurring, I unmounted it by using #umount /space where /space is the
partition in question(we knew this since it failed during the dump, plus
this is the partition to which we had just added about 270 Meg of files).
Also, in order to use fsck, the file system must be unmounted.
Then using #scu -f /dev/rrz0h, scu> verify media I found that 8 blocks
were unreadable. By accident, I discovered that just using verify media
without any block information checked the entire disk(fortunately the only
errors were in the /space partition).
Then scu was used to reassign the blocks that were unreadable:
scu> reassign lba #### where #### is the block number provided by verify.
We could do this with little worry about what happened to our data since
the partition in question has software that we have added(thus easy to
recover). Plus, the errors did not show up until this latest installation
of software, and NSR gave us a list of 6 files that had I/O errors
associated with them. Therefore, more than likely the 6 files listed
by NSR were the only ones in jeopardy.
I then re-ran the scu verify command as scu>verify media starting ####
where #### was a block number starting just before the first reported
error. No errors were reported, so remounted /space using #mount /space.
Running dump on the /space partition was successful, so
as far as I can tell, this has completely solved the problem for now.
Once the files are backed up via NSR, will we know if the problem is
solved for now. The other interesting fact is that scu> show defects grown
results in a list of 14 (now) bad sectors. The 7 of the 8 added were in
consecutive cylinders each with the same head # and sector #. Furthermore,
there were already 6 other entries with cylinder numbers that were consecutive or
that fit in with the ones we added, and these all had the same head # and sector #
as the ones we added.
The other commands provided some information, but I could not use any of
it to help with the problem. ncheck -i /dev/rz0h provided the inode associated
with each file in the /space partition. icheck did not help in any way, and
fsck -o provided about the same information as icheck. I have a feeling I did
not know how to use them to receive the necessary information.
Thanks again for the help and sorry about the length of this summary.
Ron Bowman
Techno-Sciences, Inc.
rdbowma_at_tsi.clemson.edu
864-646-4028
Alpha EB 21164, 333MHz, 1 CPU
DU 4.0B (564) Patch #6 installed
Searchable Archive URLs:
http://www-archive.ornl.gov:8000/ (simple search)
http://www-archive.ornl.gov:8000/archive/power.htm (more detailed)
The following is a summary only site graciously maintained by Matt Moore.
http://www.bucks.edu/alpha-osf-managers/
Received on Tue Jun 09 1998 - 23:20:03 NZST