Thank you all for your useful responses. I didn't try them all out, but I
shall list all the answers here, some of the others may work for someone
else.
Thank you to those of you who replied with useful suggestions:
Adrian Ho <adrianho_at_nii.ncb.gov.sg>
Hellebo Knut <Knut.Hellebo_at_nho.hydro.com>
jpizarro_at_sonda.cl
Saul Tannenbaum <stannenb_at_emerald.tufts.edu>
John Stoffel <john_at_WPI.EDU>
What I did:
I copied the disk contents to a spare disk, removed the fileset and domain,
recreated them and copied it back. This allowed me to delete the files
causing the trouble, and brought to light 6 files with I/O read errors. I
guess this should make me pretty suspicious of the disk hardware (anyone
any ideas how suspicious? This disk is probably still under warranty).
What people suggested:
(This is quite long, everyone had different things to say)
Hellebo Knut <Knut.Hellebo_at_nho.hydro.com> wrote:
>We're having the same kind of trouble at our site. The problem have been
>transferred to engineering in U.S of A. For now: DO NOT TRY TO
>REMOVE/OVERWRITE THE SUSPICIOUS FILE(S) as this will cause an immediate
>crash. Hopefully engineering will come up with something very soon.
The symptoms he describes are the same. We could read the files, change
infomration in the inode (eg. chown) but any attempt to write to the file
(either truncating it or extending it) or to remove it was an instant
panic.
A few hours later just as my disk copies were finishing, DEC had come back
to him with the following procedure:
>Here is a possible way to remove the suspicious file(s).
>
>1. Shutdown singleuser
>2. Do a /usr/field/msfsck and note the tagnumbers for the 'bad' files. If
>no bad files, go to 6.
>3. Do '/usr/field/tag2name <mount-point>/.tags/<tag-#>' on the tags noted.
>4. rm the files from the /usr/field/tag2name command
>5. go to 2.
>6. reboot.
>
>This procedure was given to us by DEC.
I shall run this procedure over all my advfs disks this evening to see if
there are any more bad files lying around waiting to pounce.
Saul Tannenbaum <stannenb_at_emerald.tufts.edu> wrote:
>It would appear to me that your problem is a hardware error on
>the disk. This is based on the following lines:
>
>>advfs I/O error: setId 0x2f4e9138.00055363.1.8001 tag 0x00000001.8001u
>>page 5392
>> vd 1 blk 1338256 blkCnt 128
>
>A look at your error log would confirm this.
>
>AdvFS is admitted to be highly sensitive to underlying hardware problems.
>This is one of the few AdvFS problems I've not run into. I'd suggest
>dumping and restoring the filesystem in question and getting the disk
>replaced.
I guess I should be suspicious of the disk.
>You don't say which version of OSF you're running. I'd suggest upgrading
>to 3.2c, or getting all AdvFS patches for your current version. It makes
>AdvFS significantly more stable, though I don't know if it addresses the
>instability the face of bad hardware problem.
I am currently running version 3.2a. I will be upgrading to 3.2c at the
first oppurtunity (ie. when the disks arrive here)
John Stoffel <john_at_WPI.EDU> sent me the following note he had saved. Some
portions of this were in the archives but the full note would have been
useful to me, so I am including it here.
>From: tl_at_ae.chalmers.se (Torbj|rn Lindgren)
>Subject: Re: [HELP] How to restore advfs inconsistency?
>
>In article <klaus.801304250_at_manuel.physik.fu-berlin.de>,
>Wolfram Klaus <klaus_at_manuel.physik.fu-berlin.de> wrote:
>>Of course our backup is not too new. So we are desperately seeking for
>>a way two recover from the inconsistency without loosing too much
>>data.
>
>You don't say which version of OSF/1 you are running!
>
>Anyway, you could try /usr/field/msfsck (repeat a couple of times) and
>then /usr/field/vchkdir (repeat a couple of times). If you run them
>without parameters you get a small help-note (that isn't correct in
>all details!).
>
>I have appended a tech-note containing more information about
>/usr/field in OSF/1 3.0 at the end. The syntax given there seems to be
>correct one for both 2.x and 3.0, and in both cases the help given by
>the programs are slightly wrong (but close enough that you could
>probably figure it out without the tech-note).
>
>Another way is using vdump/vrestore. At least in OSF/1 2.x you
>sometimes had to do the vrestore to an UFS filesystem to get rid of
>the inconsistency's!
>
>-
>
> ADVFS Utilities in /usr/field
>
>NOTE: This is a description of the ADVFS v3.0 version of these programs.
>Earlier versions may not support all these features (or they may not
>even exist on earlier version of ADVFS).
>
>msfsck
>
>This is the ADVFS bitfile-subsystem metadata structure checker. It verifies
>low-level meta-structures like the BMT, storage bitmap, and tag directories.
>
>The file domain must be inactive to run msfsck. You also need at least
>one mounted fileset (this is because msfsck uses the .tags directory in the
>fileset to access the metadata).
>
>To run it, first 'cd' to the mount point of a mounted fileset.
>Then, run "/usr/field/msfsck -t <domain-name>".
>
>vchkdir
>
>This is the ADVFS directory structure checker and fixer. It verifies that
>the directory structure is correct and that all directory entries reference
>a valid file (tag) and that all files (tags) have a directory entry. The
>-f flag will create symlinks in "<mount-point>/lost+found/" to all files (tags)
>that do not contain a directory entry; these are called lost files. The -f
>flag also remove 'dead' directory entries (ones that do not point to valid
>tags).
>
>The -d option will delete lost files and it will delete corrupted
>directories. Note, that you may need to run vchkdir several times
>to cleanup a fileset.
>
>The file domain must be inactive to run vchkdir. The fileset to be
>checked/fixed must be mounted.
>
>To run it do "/usr/field/vchkdir <mount-point>".
>
>shfragbf
>
>This program displays information about a fileset's Fragment File. The
>Fragment
>File contains file fragments less than 8K. These are used to minimize wasted
>disk space due to internal file fragmentation (for example, ADVFS will store
>a 1 byte file in a 1K fragment rather in a 8K page). The Fragment File
>is always tag (inode) 1 in a fileset and can be accessed via the fileset's
>.tags directory.
>
>To run it do "/usr/field/shfragbf <mount-point>/.tags/1".
>
>
>tag2name
>
>This program will display the full pathname of a file when only the
>file's tag (inode) number is known. This is mainly a debugging aid
>when msfsck or vchkdir report errors for specific tags.
>
>To run it do "/usr/field/tag2name <mount-point>/.tags/<tag-number>".
>
>
>switchlog
>
>This program provides the capability to resize the transaction log
>or to move it to a specific volume in a domain. NOTE: To date there
>has been no reason to change the size of the transaction log so we
>do not recommend doing this.
>
>To move the transaction log to another disk do
>"/usr/field/switchlog <domain-name> <new-volume-number>".
>
>Use showfdmn to determine the current volume that contains the log
>and to determine a suitible target volume.
>
>switchlog can be used on an active system.
>
>
>
>vods
>
>Displays the BMT on-disk structure. It is beyond the scope of this note
>to describe this utility as it requires initimate knowledge of the
>BMT structure to use and interpret the output of 'vods'. It is mainly
>a low-level debugging tool.
>------- end of forwarded message -------
jpizarro_at_sonda.cl asks:
> I see from you [Q] e-mail that you have two cpus in your
> machine. Did you apply the mandatory patches for SMP machines.
>
>===============================================================================
>
>/usr/sys/BINARY/lockprim.o
>CHECKSUM: 18514 29 RCS: 1.1.22.3
>---------------------
>
>Patch ID: OSF320-036, OSF320-135
>
>MANDATORY Patch for all SMP configurations.
>
>
>Problem 1: [Patch ID: OSF320-036] (29197)
>**********
>
>On certain error conditions, the SMP locking primitives do not work properly.
>
>This problem can cause data corruption or system panics in various parts
>of the operating system.
>
>
>Problem 2: [Patch ID: OSF320-135] (HPAQ65306)
>**********
>
>This is a MANDATORY patch for multiprocessor systems with lockmode > 1.
>
>When X.25 was started on an AXP 2100 multiprocessor machine,
>it immediately paniced with the message: tb_shoot ack timeout
>(function 16 in the following stack trace).
>
>===============================================================================
>
> Maybe this can help.
I shall check into this. We have applied some patches but I'm not sure which.
And the first to reply seems to actually not apply in this case.
Adrian Ho <adrianho_at_nii.ncb.gov.sg> wrote:
>We had quite a few problems with ADVfs last year -- unfortunately, I've
>lost the logs for that period, so I'm not sure if it's the same problem.
>
>Basically, the problem we had was that we'd used the wrong disk type to
>label our RAID drives (RZ28's on a StorageWorks chassis -- we used "RZ28"
>instead of the correct "SWXCR"). The result: OSF/1 thought the drive was
>bigger than the ADVfs filesystem actually had space for, so that when the
>drive got really full (~96%), it started writing all over the ADVfs
>information blocks. Blammo -- instant ADVfs exception.
>
>Since you don't seem to be using the same disk chassis, my observations
>above are included FWTW.
Thank you all for you help.
Hoping for a crash free day.
--
Clare West, Rm 107, Ext 8266
clare_at_cs.auckland.ac.nz
Received on Wed Aug 23 1995 - 00:07:33 NZST