Hello
Once Again a Special Thanks to Alan Nabeth for the answer.
I am able to take backup of most of the data.When I tried to take
backup of my entire partition it gave IO error. But I am able to take
small backups of individual dir(s) by first tarring it and then using dd
command to take backup.
But the information I got from Alan is worth sharing
Original Question
=================
Hello
We have an XP1000 workstation also with Tru64Unix V4.0g .We have two
SCSI HDD (18 GB) and one TLZ10 Tape drive, all on same onboard SCSI
Controller.
Today morning when a student tried to login with his ID and passwd
system didn't allow him to do so. Then we logged in via root account and
to our surprise .profile file of that student were found to be empty.
All the user accounts on this disk showing same behavior .Also console
is logging the Message " Deferred I/O error (at block 5)....." Very
frequently when we try to access data on this disk.
We have faced the similar problem on same HDD in past also. At that time
Compaq engineer formatted the drive using scu and then run verify
command on it .Also that time we had recent system backup so we just
restored the backup .System ran well for 3 months and then again we are
facing the same problem
The real problem is this time we don't have backup of this HDD and very
critical data of PhD students are there. If data is lost some of the
students may lose all their projects.
My question to all of you is
1. If the HDD is having bad blocks/corrupted what is the best and most
reliable method of taking backup of this HDD.
2. What steps can I take to retrieve most of the data from this HDD?
I am scared of running "verify" command or "rebooting" the system
although OS is not on this HDD.
Please reply ASAP
Regards
Vikram
Alan's Reply
============
This has to be said... If the data is as important as you
make it sound, you should be making backups of it.
That out of the way...
Unless explicitly requested by an application or mount
option, all writes to UFS are asynchronous through the
buffer cache; application writes, kernel copies the data
to a buffer, the application gets completion and sometime
later (often soon after) the data is written to disk.
With this disconnect of the application getting I/O
completion before the I/O actually starts, the application
can't get I/O status. So, the kernel keeps track and
writes messages when there is an error on such an I/O.
This is where the "deferred" message comes from. I
think this error is particular to NFS and UFS, but it
may happen on AdvFS as well.
Normally, UFS just keeps the data around and if the disk
gets better, it will eventually complete. For most well
behaved SCSI disks, the operating system can ask the disk
to replace the block on a write and then retry it. However,
the command to replace a bad block is optional and not all
disks implement it.
Write failures to meta-data parts of the file system (the
data that describe where and what the data is), typically
cause the system or an AdvFS domain to panic, since the
file system is corrupt as a consequence of the failure.
As to your questions.
1. For a small number of bad blocks, the best thing to do
is make a normal backup, noting which files have errors.
If the disk supports the command to force replacement of
blocks, you can get the block numbers and replace them
yourself. That will eventually let you get a backup of
as much data as can be backed up. It wouldn't hurt to
make each pass with different media, just to have multiple
copies.
If the disk is getting progressively worse, you really only
want to read it as few times as possible. For this case
you want to consider using dd(1) to backup the partition.
Check the manual page to see if there is an option to
ignore I/O errors. You won't get useful data from them,
but it won't prevent the physical backup from completing.
It may print errors for the bad blocks, so you can note
them and translate them back to the affected files later.
In a bad enough failure about all you can is ship the
disk off to a data recovery company and hope they can
read off whatever bits have survived the failure.
2. See the answer to #1.
Rebooting will cause the data in the cache to go away, but
getting directly to it is also hard. You might try a physical
backup of the block device, instead of character device to see
if that picks up anything in the cache. You'll have to watch
the block sizes here. scu(8) has two commands for checking
whether the media is usable. One just reads, the other writes
and reads. I can't ever remember which is which. So, read
the help and the manual page before using either. One is
safe, the other not. Well, as safe as running a backup.
Any reading of some failing disks is enough to make things
go from bad to worst.
Some more information...
Block numbers reported by the file system are relative to
the offset of the partition. So, to get the disk LBN you
have to add the partition offset. For UFS, icheck(8) will
let you give it a list of block numbers and get the affected
parts of the file system. For inode numbers, ncheck will
let you back track to the affected file name(s).
For AdvFS, I think the messages that it writes to
/var/adm/messages
for I/O failures give you the tag number and the command to run
to translate the tag to a file name.
Regards
Vikram
Received on Sat Jun 08 2002 - 10:48:10 NZST