SUMMARY: binlog entries - what do they mean from Kasperski Marcin on 1998-07-17 (tru64-unix-managers)

From: Kasperski Marcin <marcink_at_crit8.zti.crt>
Date: Thu, 16 Jul 1998 15:30:09 +0200

I posted the question about the mystery uerf reports:

uerf on my machine (Digital Unix 4.0c workstation) returns huge amount
of entries like the one below:

********************************* ENTRY 1.
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 48926.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Jun 17 07:10:41 1998
OCCURRED ON SYSTEM osiw1
SYSTEM ID x00070016
SYSTYPE x00000000
----- UNIT INFORMATION -----
CLASS x0000 DISK
SUBSYSTEM x0000 DISK
BUS # x000A
                              x0290 LUN x0
                                        TARGET x2

Ken Krueger suggested, that they may be caused by a known bug in advfs.
I'm to try his workaround.

This may be due to a known bug in advfsd (the advfs daemon). This
daemon gets started each time the system boots (/sbin/init.d/advfsd)
and may begin to consume large amounts of cpu time as well as fill you
errorlogs. I think it goes out and does a disk check every 5 minutes.
Possible solutions are:
1) Remove the advfsd link in your /sbin/rc3.d directory so that the
daemon never starts. The problem with this is that if you use the advfs
  GUI, you'll need to start the daemon first (/sbin/init.d/advfsd start)
  and then stop it (/sbin/init.d/advfsd stop) when you are done.
2) Put an entry in crontab that does a stop every so often. The problem
  with this is that it could kill the process while you are using the
GUI. Another problem is that while the daemon is running, it will put
many entries into the errorlog file.
3) Don't use the advfs GUI at all. Then step one can be used to destroy
  the link and you don't have to worry about remembering to start or
stop it.

No word on when this will be fixed.
Ken

Ronald D.Bowman suggested, that the problem may be caused by some disk
error. I will do disk checking in a few days, when I'll be able to
shutdown my machine. He said the following:

        The errors you are seeing seem to indicate(from the first one) that
        there is aproblem with the hard disk drive. The second one seems to
        indicate that maybe you are using dump to back up the disk, and
something
        is not correct. I am not sure on this(and more qualified people will
        probably answer the question better), but what may be happening is that
        a file is in a sector that has a bad byte. The file is probably okay,
        but dump reads every byte of every sector, and therefore would find a
        problem bit. If you want to check the status of your disk, you can
        use scu. there have been previous summarys(some by me) that explain
this.

        here is part of one of mine:

        In order to keep people from using the partition where the errors were
        occurring, I unmounted it by using #umount /space where /space is the
        partition in question(we knew this since it failed during the dump,
plus
        this is the partition to which we had just added about 270 Meg of
files).
        Also, in order to use fsck, the file system must be unmounted.

        Then using #scu -f /dev/rrz0h, scu> verify media I found that 8 blocks
        were unreadable. By accident, I discovered that just using verify
media
        without any block information checked the entire disk(fortunately the
only
        errors were in the /space partition).

        Then scu was used to reassign the blocks that were unreadable:
        scu> reassign lba #### where #### is the block number provided by
verify.

        We could do this with little worry about what happened to our data
since
        the partition in question has software that we have added(thus easy to
        recover). Plus, the errors did not show up until this latest
installation
        of software, and NSR gave us a list of 6 files that had I/O errors
        associated with them. Therefore, more than likely the 6 files listed
        by NSR were the only ones in jeopardy.

        I then re-ran the scu verify command as scu>verify media starting ####
        where #### was a block number starting just before the first reported
        error. No errors were reported, so remounted /space using #mount
/space.

        ---
        You may not have to unmount the file system, but it probably would not
hurt
        to do it if you can. man on scu may provide info on that.

        also, in scu you can use "show defects all" to get a listing of all
logged
        defects with the disk. "show defects grown" will give a listing of the
        defects that have been found since the disk was formatted. Also, I
believe
        that sectors that are currently bad and containing a file will not be
put
        on the grown list until the file is removed from that sector.

        I hope this helps, or if it is way off base of the actual problem that
        it provides you some information that you do not already know.

Received on Thu Jul 16 1998 - 16:10:25 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:38 NZDT