SUMMARY: scu verify media frozens the machine from David Komanek on 1999-05-19 (tru64-unix-managers)

From: David Komanek <xdavid_at_aragorn.natur.cuni.cz>
Date: Wed, 19 May 1999 11:24:04 +0200 (MET DST)

Thank for the replies, especialy to :

Tom Blinn, Alan Davis and Alan Rollow.

According to their suggestions I solved the problem using 'diskx' utility
and 'scu' after remounting root partition in r/w mode (scu> verify media
aparently hangs the machine when applied on root partition, when this is
mounted in r/o mode or from multi-user mode on my machine). After aplying
uerf -o full I saw that running diskx caused some "soft-errors" to be
repaired automaticaly. Now it seems the partition is OK.

Thanks again to all who replied.

My original message and the replies follow.

Sincerely,

David Komanek

------------------------------------------------------------------------------

ORIGINAL MESSAGE:
>
> Hi all,
>
> can anybody help me with the following ?
>
> I'm running Alpha Server 400 with DU 4.0d + patch-kit #3
>
> - today I experienced "I/O Errors" in one of the two swap-files. So I
> went to single-user mode and tried
>
> scu -f /dev/rrz0b
> > verify media
>
> after about half of the verification the process frozens the whole
> machine. The only way is to reset it.
>
> The same behavior in multi-user with the partition unmounted
>
> Is there any other posibility to check for bad blocks or other way of
> "reviving" the partition ?

--------------------------------

REPLIES:

Tom Blinn:

To the best of my knowledge, the "scu" utility does NOT understand about
disk labels and disk partitions. It's dealing with the device at a very
low level.

You should be able to boot the system to single user mode, make the root
file system writable (using mount -u /), then cd into /sbin and remove any
symlink swapdefault that points to a swap partition (rm -f
/sbin/swapdefault). Once you do this, you can reboot the system and get
to single user mode WITHOUT any swap getting turned on, and you can verify
this with "swapon -s" (as root).

You can then mount your local file systems with "bcheckrc" and then if
you've got the system exerciser software on your system, you can use
/usr/field/diskx (there is a good reference page) to run read-write checks
on the partition you think is getting errors (/dev/rrz0b). If in fact you
are getting errors, and you are lucky, you will be able to identify the
actual disk blocks that cause the errors, and map them out using "scu" so
they won't be accessed.

However, if the "scu" verify media command hangs on device rz0, then there
is a possibility that you've actually got a bad disk. On the other hand,
using "scu" on a disk that's got swap on it, while swap is active, may
well be what caused the system to hang. If I really want to use "scu" on
the system disk (I assume that rz0 aka DKA0 is your boot/system disk),
then I'd probably boot the system from the installation CD to do that
testing.

--
Alan Davis:
  Remove the swap partition from the /etc/fstab file, make sure that it's
not the one pointed to be /sbin/swapdefault and reboot. This will take
that
partition out of use. Boot the system to single user using 
        >>> boot -fl s
mount root r/w with 
        # mount -u /
mount /usr
        # mount /usr
Then use scu on the partition.
--
Alan Davis again:
  If the partition that you are checking is swap1, the kernel already
knows
about it at boot time because it is defined as the dumpdev. That's why I
suggested making sure it's not in the fstab or pointed to by
/sbin/swapdefault. If it's not swap1, it may just be that the i/o errors
are
causing scu to hang. 
  
 In my experience, if you have i/o errors on swap you should replace the
disk rather than depend on the bad block map and bad block replacement.
--
Alan Rollow:
The failure of verify suggests that your I/O errors may have
been more than just bad blocks.  I believe verify translates
to a series of commands to read large stretches of blocks,
possibly without transferring any data.  By using multiple
commands it won't tie up the bus too long.  Look at the error
log for the previous errors and see there was sense data for
the device or bus resets.  If you use uerf(8) to format the
log, you need to use the option "-o full" to get all the
information available.  DECevent is the better choice for
formatting the log when available.
You can get close to what a verify does by using the dd(1)
to read from the raw device and write to /dev/null.  But,
if the problem is more than bad blocks, dd(1) will probably
cause the problem.

Received on Wed May 19 1999 - 09:26:35 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT