SUMMARY: Weird AdvFS crash & AdvFS verification errors.

From: Thomas Leitner <tom_at_finwds01.tu-graz.ac.at>
Date: Tue, 03 Mar 1998 23:10:56 +0100 (MET)

Hello,

This is a summary of my two postings "Weird AdvFS crash under 4.0D ...."
and "AdvFS verification errors: Should I be worried ?".

In a nutshell: The AdvFS verification errors are definately something
to worry about. They even may have caused the crash (though this is not
100% sure). There is currently no fix under DU 4.0D for these AdvFS
verification errors other than to re-build the domain. The Digital
AdvFS wizards have this on their list, though.

Well: I've rebuilt the root_domain, usr_domain and home domain of
our main server with only a short downtime by copying the root_domain
to a spare disk, booting from it and restoring it back to the original
system disk. I did the same with the in single user mode. The home domain
was too large for this so I had to do a full tape backup and restore
during the night (from home actually).

Everything seems to be well now, apart from one minor glitch: When
verifying the newly created root_domain with "verify -r" I still
get these "check_tagdir_page: in-use on-disk tag does not have matching
in-mem tag hdr" errors which do *not* show up when booting from
another disk and verifying this particular domain. Dr. Tom Blinn
forwarded this to the responsible people. See also his explanation
below.

So thanks to: "Dr. Tom Blinn, 603-884-0646" <tpb_at_zk3.dec.com>
              Oisin McGuinness <oisin_at_sbcm.com>
              Don Rye <rye_at_jtasc.acom.mil>

Here are their replies. I've omitted my original postings for the sake
of brevity.

Tom

--------------------------------------------------------------------------
From: Oisin McGuinness <oisin_at_sbcm.com>

We had some AdvFS crashes on v3.0 and v3.2 machines (some caused by
disk problems it is true). Those caused by AdvFS problems were alleviated
by running the utilities msfsck and vchkdir. From the descriptions I have of these, it
seems that the functionality of both is now part of verify, although vchkdir is still in /sbin/advfs,
so maybe it still performs a useful function. The syntax for running it is:

vchkdir [-d] [-f] mount-point

(The file set must be mounted, but inactive. -d and -f seem to have the same meanings as for verify.)

The (underground) documentation says "Note that you may need to run
vchkdir several times to cleanup a fileset".

We have not had any crashes on AdvFS on our 3.2G machines after installing all the
AdvFS patches, and running defragment religiously. Our 4.0D servers, with several
file systems, have not yet crashed in some months of file server service..... (Touch lots of wood.)

I would be very interested in any info you get from Digital about your problem.

Thanks in advance.

--------------------------------------------------------------------------
From: Oisin McGuinness <oisin_at_sbcm.com>
Subject: Re: AdvFS verification errors: Should I be worried ?

About vchkdir: it is sitting on our 4.0D machines, but checking the dates it does look
as if it has sat around since they were 4.0B (I did installupdate rather than a clean install).
Sorry for any confusion!

So presumably all the functionality of vchkdir is now in verify, though it is interesting
that the verify man page does not mention running it more than once, as the instructions
for vchkdir did.

Thanks for your reply, I'm looking forward to the summary.

--------------------------------------------------------------------------
From: rye <rye_at_jtasc.acom.mil>
Subject: Re: Weird AdvFS crash under 4.0D ....

Thomas..

  As I recall, from having pain with HSZ70's and AdvFS, some of the patches the fixed all sorts of
Advfs errors are not in the 4.0D release. I may be wrong on this but I'm waiting on the first patch
release before moving my Intraserver based system up to 4.0D.

Don

--------------------------------------------------------------------------
From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_zk3.dec.com>
Subject: Re: Weird AdvFS crash under 4.0D ....

As you say,

> ** There aren't any I/O errors logged in uerf !!! Note that the disk
> in question is *NOT* the system disk.

but you also say

> Note that these disks are external fast/narrow scsi disks connected
> to an IntraServer ITI-3140 ultra/wide controller.

What makes you think that IntraServer's driver would ever log anything to
the error log? But, as you note, there is little evidence that the disks
are in fact experiencing hardware errors.

You are assuming that the error numbers reported by the AdvFS code are the
same as the error numbers present in errno.h -- but that doesn't have to be
the case, since the AdvFS code is entirely inside the kernel, and errno.h is
for error number commnicated by the kernel back to library code; I'd have to
look at the AdvFS code that generates those messages to tell whether those
are the same error numbers as used in errno.h, but I'd be surprised if they
were.

It's possible that there is a false error return from the IntraServer
ITI-3140 controller code into the SCSI subsystem, but I would expect it
to result in a SCSI CAM error being reported, and you don't seem to have
those.

In any case, it looks like the AdvFS code got into major trouble using the
disks connected to the IntraServer ITI-3140, and that lead to corruption
in the domain, and that lead to the system panic.

I wish I could tell you that it's a known problem and will be easy for us
to deliver a fix, but I suspect it's not a known problem and that it will
not be easily fixed, especially if it can't be reproduced on systems that
don't have the IntraServer ITI-3140 hardware and software.

I'll pass your message along to our file systems product manager (who is
always interested in the things our customers are doing with the software)
and to the team leader for the AdvFS team.

Tom
 
--------------------------------------------------------------------------
From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_zk3.dec.com>
Subject: Re: AdvFS verification errors: Should I be worried ?

I meant to recommend that you do what you have done; in fact, I probably
should recommend to EVERYONE on the list that if they are going to update
to V4.0D, they run AdvFS verify on all the domain before updating, because
the new code is better in some cases about catching errors and will panic
the domain or the system when they are found. These could have been there
all along (I sure hope they were).

You may need to rebuild the domains and restore them, either by copying from
disk to disk (if you have a suitable spare disk) using vdump/vrestore, or by
backing up to tape and restoring.

I don't know the definitive answer, but I'll pass this message along as
well.

Sigh..

Tom
 
--------------------------------------------------------------------------
From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_zk3.dec.com>
Subject: Re: Weird AdvFS crash under 4.0D ....

Tom, I ran your second message (about the verify errors) past our AdvFS
wizards, and they had this to say:

> > As a result of my last advfs crash, I did a verify on all of my AdvFS
> > partitions and found that almost all of them contained errors like:
> >
> > Checking tag directories ...
> > check_tagdir_page: bad allocated tag count
> > set tag: -2.0 (0xfffffffe.0x00000000)
> > tag: 1.32769 (0x00000001.0x00008001)
> > tag directory page: 18
> > real count: 957, expected count: 954
> > check_tagdir_page: bad free tag count
> > set tag: -2.0 (0xfffffffe.0x00000000)
> > tag: 1.32769 (0x00000001.0x00008001)
> > tag directory page: 18
> > real count: 65, expected count: 68
> >

This is the second time in the last two months I've seen this problem.
Currently there is no way to fix it. I would suggest getting a vdump of this
domain before this page fills up, as it will CRASH at that time.

This is on our list of things to fix in the new 'fixer tool'.

> > The root partition additionally has:
> >
> > Checking tag directories ...
> > check_tagdir_page: in-use on-disk tag does not have matching in-mem tag hdr
> > set tag: 1.32769 (0x00000001.0x00008001)
> > tag: 489.33517 (0x000001e9.0x000082ed)
> > tag directory page: 0
> > tag map entry: 489
> > seqNo: 0x82ed (33517)
> > vdIndex: 1
> > bfMCId (page.cell): 83.4
> > check_tagdir_page: in-use on-disk tag does not have matching in-mem tag hdr
> > set tag: 1.32769 (0x00000001.0x00008001)
> > tag: 840.33611 (0x00000348.0x0000834b)
> > tag directory page: 0
> > tag map entry: 840
> > seqNo: 0x834b (33611)
> > vdIndex: 1
> > bfMCId (page.cell): 128.22

Another fairly common corruption, this is on our list for the fixer as
well.

> > * Should I be worried?

Yes, vdump and vrestore the fileset to a new domain ASAP.

> > * /sbin/advfs/verify did not fix these errors even with the -f
> > and -d flags. How can I fix them without re-building the domain
> > and restoring the files from tape?

At this time there is no tool. You would have to go in by hand and edit
meda-data a VERY VERY DANGEROUS thing todo.

------------------

So, in a nutshell, the strong recommendation is to go the backup/restore
route, or if you've got the disk capacity, reconstruct on-line by vdump to
a vrestore.

Tom
 
--------------------------------------------------------------------------
From: "Dr. Tom Blinn, 603-884-0646" <tpb_at_zk3.dec.com>
Subject: Re: Weird AdvFS crash under 4.0D ....

Tom, I'll pass your question about the verify -r reported errors, that is,
the "in-use on-disk tag does not have matching in-mem tag hdr" errors; this
may well be OK, since when the domain is open (the root fileset is mounted)
there might be tags in the in-memory information that aren't represented on
the disk, and that might be OK, because they may simply represent files that
are so new they are still "in transit"; or they might represent things that
are in the process of being deleted. I simply don't know -- but I can see
how the error messages would cause concern. I don't see anything in the man
page for the verify command that makes this clear one way or the other. It
is certainly plausible that some metadata in memory is inconsistent with the
metadata on the disk when a fileset is open, which is a reason why normally
you have to have all the filesets unmounted.

I've got the full AdvFS documentation, and it's not clear there, either.

I'll let you know what I learn; I suspect this is something that just isn't
made clear in the documentation, but isn't really an error.

Tom

--------------------------------------------------------------------------
T o m L e i t n e r Dept. of Communications
                                            Graz University of Technology,
e-mail : tom_at_finwds01.tu-graz.ac.at Inffeldgasse 12
Phone : +43-316-873-7455 A-8010 Graz / Austria / Europe
Fax : +43-316-463-697
Home page : http://wiis.tu-graz.ac.at/people/tom.html
PGP public key on : ftp://wiis.tu-graz.ac.at/pgp-keys/tom.asc or send
mail with subject "get Thomas Leitner" to pgp-public-keys_at_keys.pgp.net
--------------------------------------------------------------------------
    Before we have the paperless office, we have the paperless toilet!
Received on Tue Mar 03 1998 - 23:11:20 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT