SUMMARY: data and file system corruption after a panic from Kurt Carlson on 1996-05-23 (tru64-unix-managers)

From: Kurt Carlson <SXKAC_at_orca.alaska.edu>
Date: Wed, 22 May 1996 15:43:08 -0800

First, many thanks to (in no particular order):

        alan_at_NABETH.CXO.DEC.COM
        chris_at_lagoon.meo.dec.com (Chris Jankowski)
        Richard L Jackson Jr <rjackson_at_portal.gmu.edu>
        Jeff Finkelstein <finkels_at_alf.dec.com>
        Gert.Maaskant_at_nedlloyd.nl (Gert Maaskant)
        Bob Haskins <haskins_at_myapc.com>
        Ry Jones <RJONES_at_hbsi.com>
        bouchard_l_at_decus.fr (Louis Bouchard - Bouygues Telecom)
        Martin Moore <martin_at_jerry.alf.dec.com>

My apologies if I missed anybody. Particular thanks to Chris Jankowski.
Summary is going to include interweaving of all the information gathered
with some additional details to frame the problem. This summary is only
partial as the root problem has not yet been resolved, and it's going to
be a bit long.

My original questions with short answers imbedded:

> We have had two panics which resulted in corrupted data, some details
> are attached. My questions:
>
> Has anybody else seen data corruption in a panic?

Yes.

> Can anybody provide some detail as to why this may occur, how it can
> be detected (after the panic), and how it can be prevented?

Some theoretical details are attached.
Exactly what happened to us is still unknown and in the
hands of Digital engineering at this time.
Some suggestions on integrity validation and some preventative
techniques are also attached.

> Has anybody else seen a system hang during the 'syncing disks...'
> phase of a panic?

Yes with 2100's, solution is to 'halt' then crash.
No suggestions for 8400 provided (no halt button).
This is a particularly severe problem in its ramifications.

> [...] is Unix flawed in it's write data caching?

Supposedly not, details much further below.

> Details:
>
> We are running an 8400 on DU 3.2d-1 with all disks under advfs in
> raidsets behind kzpsa's (a10) on hsz40's (v2.5).
>
> Since February we've had six panics 'simple_lock: time limit exceeded'
> which seem to be triggered by tape errors.
> [...]

This is an extremely complicated series of problems, the panics
(just from the most recent of the two events) include 3 different
types within a few hours:

        \%\%\% Wed Apr 24 05:00:00 AKDT 1996 \%\%\%

(1) simple_lock: time limit exceeded

            pc of caller: 0xfffffc00004c172c
            lock address: 0xfffffc007f8fc7d0
            current lock state: 0x00000000004c1605 (cpu=0,pc=...

        panic (cpu 0): simple_lock: time limit exceeded
        syncing disks...
        ------------------------------------------------------------------------
        [...system rebooting, processing rc3.d...]

        -> /sbin/rc3.d/S90schedule start
        [...]
        ***
        *** POLYCENTER Scheduler has STARTED all requested processes
        *** Startup script procedure exiting with success.
        ***

         SCHED-I-PARTOPENREQ, Partition (ROOT_GLACIER) was requested to Open

(2) trap: invalid memory read access from kernel mode

            faulting virtual address: 0x0000000000000015
            pc of faulting instruction: 0xfffffc00003f4018
            ra contents at time of fault: 0xfffffc00003f3ac4
            sp contents at time of fault: 0xffffffffb8342ad0

        panic (cpu 0): kernel memory fault
        syncing disks... done
        ------------------------------------------------------------------------
        \%\%\% Wed Apr 24 10:00:00 AKDT 1996 \%\%\%

(3) # ADVFS EXCEPTION
        Module = 2, Line = 1356

        panic (cpu 0):
        syncing disks... done
        ------------------------------------------------------------------------

It's fairly obvious that the first panic (simple_lock) either directly or
indirectly caused a large degree of corruption which lead to the subsequent
panics. The corruptions included:

  1.1 /etc/auth/system/ttys
        This file was corrupted causing telnet logins to fail until recovered.
        Note, CSC said this file often gets corrupted.

  1.2 /var/sched/tmp/sched_22-Mar-96_05:32:13.exp
        This directory was corrupted, access ('ls -l') caused panic type (2).

  1.3 /usr/bin/test
        /usr/bin/tsort
        /var/sched/help/<something>
        These showed up in the msfsck and 'vchkdir /var' crashed the system.
        Domain containing /var and /usr required restoration.

  1.4 oracle (/u??/ORACLE/FINP/data/?) FBGENL and FGBTRND
        There were corrupted rows in both tables which showed up
        with an attempted full export of the Oracle database.
        No corruption was apparent at the file system level.
        Required action was to recover and roll-forward Oracle database.

  1.5 /users/sx/sxcnb99/ipl/now/SYS1.PROCLIB_RDR400
        /users/sx/sxcnb99/ipl/now/SYS3.PROCLIB_CAS
        Both showed up in find as 'bad status'; apparently mapped to
        the directory but not really there.

  1.6 /oraalogs/ORACLE/FINP/arch/*41*.dbf (or possibly 40)
        An attempt to delete this file caused panic (3). On reboot,
        no filesets in that domain could be mounted (restore required).

The details above from the 4/24 event were more severe than our 2/15
event. During the 2/15 there was no obvious corruption (system rebooted
and databases restarted). However, there was an oracle index corruption
comparable to the data corruption in 1.4 above which later caused
oracle to crash and required a restoration and roll-forward.

It is very easy to get lost in the details of this. I'll attempt to
break it into distinct problems:

  p#1 simple_lock: time limit exceeded... what is it and how to avoid
  p#2 trap: invalid memory read access from kernel mode
  p#3 # ADVFS EXCEPTION
  p#4 what caused the corruption
  p#5 how can one tell if something was corrupted in a panic
  p#6 how can one prevent things like this

------------------------------------------------------------------------------
  p#1 simple_lock: time limit exceeded... what is it and how to avoid

In our case (and at least two other sites) these result from soft tape
errors from one of our TZ87's. We applied a third patch from Digital
to cam_tape.o last week to help prevent these (we had 3 panics
of this type since 04/24, no corruption with those three).
There is also a series of patches for this problem under DU 3.2c.

If you get a simple_lock panic and it analyzes something like:

  simple_lock: time limit exceeded

     pc of caller: 0xfffffc00004c168c
  [...]
  root_at_glacier> dbx -k /vmunix
  (dbx) 0xfffffc00004c168c/i
   [ctape_close:2674, 0xfffffc00004c168c] bsr r26, simple_lock(line 262)

You may have the same problem and require patches... contact CSC.

One site avoided the problem by making the tape errors go away, their
solution was replacement of the controller board in their TZ87.

Field service re-seated the controller board in our offending TZ87
(only one of 4 generating errors) last week and its error volume was
reduced from 20 to 100 in a 6 hour heavy use period to 1 or 2 in a
heavy use window.

These are not the only type of 'simple_lock' panics, another site
reported:

> [ss_finish:1081, 0xfffffc000041c160] bsr r26, simple_lock(line 262)
> [cdisk_strategy:2658, 0xfffffc00003ed864] bsr r26, simple_lock(line 262)

and that panic also resulted in significant corruption to ufs based
file systems.

Note, this type of panic does not necessarily cause corruption.
We had several simple_lock panics with no corruption.
However, it may trigger some other weakness if it isn't the cause.
If you get this type of panic, contact Digital.

------------------------------------------------------------------------------
  p#2 trap: invalid memory read access from kernel mode

In our case, this was a reproducible panic (2) from above.
We moved the disks from our 8400 to a 2100 and could cause the same panic on
the 2100 with an 'ls -l' on the affected directory.
Digital provided replacement disks and requested those three (a raidset)
be sent back to engineering for further analysis.

It is pretty clear this directory was corrupted as a result of (directly or
indirectly) the initial panic (1). As this was caused by something else
I'm not particularly concerned with the panic itself. I am concerned
with what caused the damage... see p#4.

------------------------------------------------------------------------------
  p#3 # ADVFS EXCEPTION

The particular symptoms of this advfs panic have a fix in OSF360-018
consolidated advfs patches. I believe the current advfs consolidated
patches are now OSF360-028 (as of 5/4).

We are now scanning atlanta.service.digital.com:/pub/patches/osfv32d-1
on a weekly (automated via script) basis for new patches and if any are
available automatically FTP'ing the patches.tar.Z file.

I am still waiting feedback from CSC whether it is recommended or not
to automatically apply the patches, we certainly plan to review them
for applicability here as they come out.

If you happen to be a heavy advfs user you may want to consider regular
application of the consolidated advfs patches (we are considering it).

------------------------------------------------------------------------------
  p#4 what caused the corruption

As yet this isn't known with any certainty.

The initial panic (1) hung during the 'syncing disks...' phase of the panic.
We happen to license the OSF source CD so I did do a cursory review of the
source. At this point the system is trying to flush any pending writes to
the disks, in our case it never finished (hung). Logically, this could be
a cause of some type of corruption if not handled properly by the
operating system and hardware/firmware.

Unix itself does utilize asynchronous writes, however for filesystem
information:

> Metadata is always written synchronously so if it is corrupted it is either
> a software problem or underlying data storage problem. In other words
> metadata writes are not cached in the buffer cache.

Theoretically, we should not have corrupted advfs unless there is
a 2nd problem hiding behind the simple_lock panic. The failure
to sync disks is a concern, it would be comparable to a hard power
failure when disks could not be sync'd:

> On UNIX it is generally assumed that you may loose up to 30 seconds of last
> data if the power does down. Applications who care about this use log based
> approach and synchronous writes to preserve consistency of the data.
> Oracle does that.

Again, Oracle and advfs are supposed to be rather robust and they should
have recovered. Unfortunately, they did not.

A suggested possible cause of corruption is an earlier undetected error
in the storage subsystem. As we run reports from all HSZ's and binary.errlog
nightly (automated via scripts) we did not / do not consider this a viable
cause of our problem. All affected disks were on hsz40's in raid-5
configurations.

Another possible source for lost writes (corruptions) would be a failure
in the prestoserve module. In our case the "flushing dirty buffers"
messages after the panics appeared to be "correct". However, we elected
to disable our presto because we intend to migrate to ASE and it's not
supported under ASE at all. Coincidentally (or not), with the three
subsequent (1) panics which hung during 'syncing disks...' we had no
corruption. Several people stated there were no reports of presto
corruption problems subsequent to OSF v2.*. Besides deconfiguration of the
presto there were several other subtle changes between the two events
which caused corruption here and two which didn't:
        System activity was lower at times of non-corruption panics;
        Second (of three) cam_tape.o patches was applied;
        OSF360-018 consolidated advfs patches were applied;
        Root domain moved to rz28m's from rz29b's (fw 003) on a less
          used HSZ40 and bus.
However, until Digital engineering can identify a cause there is some
grounds to suspect a flaw with the presto, perhaps in conjunction with
the 'syncing disks...' hanging.

Remaining possible causes are software failure (Unix or advfs) or the
particular symptoms of this particular panic. A panic by nature is
something unexpected by the operating system in a state it shouldn't
be in, results may be unpredictable... OS's job is to clean up quickly
(sync) and dump. There is some possibility software is the cause.
If this is the cause there may be some aggravating condition within
our configuration as the problem is apparently infrequent, albeit severe.

Again, we don't have a cause. Digital engineering is supposed to
be working with us on that. Since this happened twice (2/15 and
4/24), I am being persistent on this issue as it could happen again
to us or another site if the cause isn't identified.

------------------------------------------------------------------------------
  p#5 how can one tell if something was corrupted in a panic

Besides the ones mentioned here, we've had a number of panics on our
collection of DU systems (6 systems) which most of the time recover
entirely with the following exceptions:
        the ones cited here;
        two other advfs panics where domains were no longer mountable
          (patches available for this);
        one with corrupted hsz writeback cache after a power failure
           due to a defective battery.
In all cases except the 2/15 and part of the 4/24 events the corruption
was obvious (and we at least knew what had to be recovered).

The exposure on the 2/15 problem was particularly bad. The system
rebooted and Oracle restarted successfully. Five hours later Oracle
crashed (apparently when the corrupted index was touched) and would not
restart. The only way we could roll-forward the last five hours of
activity was to first drop the affected index and rebuild it (took
a long time, 13M row table) after the roll-forward.
Had the corruption occurred in data instead of an index (as it did on 4/24),
roll forward would have been impossible. We did not automatically restart
the databases on 04/24 suspecting problems. Any commercial application
losing 5 hours of activity (or even the 30+ hours of recovery time)
can have devastating consequences to a business.

While that's a nice sob story, the point is something may be corrupted
and not be obvious until hours (or days) later when the damage can
be much worse. Don't assume your database is OK just because it starts.

At this point, we no longer automatically restart database applications
after a panic. This is accomplished here by detecting the "dirty"
shutdown and not proceeding to 'init 4' where we chose to start the
applications (oracle databases). We now insist one of the technical
services staff review the nature of the panic. If it's a "normal"
panic where disks sync'd, file systems re-mount successfully, and there
are no suspicious errors logged (either console log or via uerf) we
proceed with restarting the databases. If the system hung during
'sync disks...' we are at this time doing an 'ls -lR / | grep' (of
conditions we saw 4/24) and doing a full export of the databases to
try and detect errors. If we detect or suspect any corruption,
then besides the necessary restores/roll-forwards we will continue
with /usr/field advfs analysis tools and potentially a dd across
the disks to detect bad blocks.

Some specific suggestions given to us which I found quite pertinent:

> At the moment I would:
>
> 1. Stay with AdvFS for the time being.
> 2. DO *one* sweep of disks using dd to make sure there are no bad blocks.
> 3. DO *one* full sweep with the /usr/field tools to make sure that the
> metadata is consistent.
> 4. Make sure that you have latest consolidated AdvFS patches. If not apply.
> Repeat every three months or so.
> 5. Postpone DECsafe implementation till the system is stable.

Further discussion of '1' above follows in p#6 below. As for '2' and '3':

> Those two actions - checking filesystem consistency and scanning blocks
> will bring your filesystems to a known consistent state.
>
> Otherwise it is not possible to move forward as you may have some other
> inconsistencies that we do not know about as they are not exercised.

One suggestion for errors encountered with msfsck:

>Here is a possible way to remove the suspicious file(s).
>
>1. Shutdown singleuser
>2. Do a /usr/field/msfsck and note the tagnumbers for the 'bad' files. If
> no bad files, go to 6.
>3. Do '/usr/field/tag2name <mount-point>/.tags/<tag-#>' on the tags noted.
>4. rm the files from the /usr/field/tag2name command
>5. go to 2.
>6. reboot.

However, if you had anything significant showing up simply restoring
might be prudent.

------------------------------------------------------------------------------
  p#6 how can one prevent things like this

Obviously, make sure you have solid backup strategies so you can recover
(fortunately, we did).

Automate uerf or decevent reporting so you have an understanding of
any hardware errors as they occur (fortunately, we had already
done this after 2/15 and could eliminate hardware errors as causes).
We wrote a 'uerf |' consolidating report program to reduce the
reports to something quickly scannable. Decevent has improved the
reporting quite a bit over uerf, but it's still lacking some.

As stated at least twice above, develop a strategy to obtain and apply
patches in a preventative / pro-active mode before something breaks.
Digital does need to do something further to facilitate this. Also, keep
in mind patches sometimes break something when they fix something else...
review vs. automated application of patches might be a good idea.

I also suggest careful attention to logging details... retention
of console logs electronically (in our case via console manager)
has been critical for after-the-fact review of what happened (you
can't prevent something re-occuring if you don't know what happened).
Retention of other log detail (binarry.errlog, dumps, etc.)
is also important.

We have considered abandoning advfs (twice now). However, with regards
to these types of corruptions there is no logical difference between
advfs vs. ufs (and to a degree raw devices) if the cause were presto
or the state of the panic or a Unix condition triggering it.
At least one other site had significant corruption with UFS.

Some additional comments on raw devices:

> It is worth noting that most database systems seem to recommend
> using raw disks, which have full synchronous writes, unless you
> have a disk or array with write caching. Using the file system
> with a database goes against these recommendations.
>
> They do recommend using raw devices for a reason. Their ability
> to avoid data copies from kernel memory to user space and to
> use different caching algorithm are probably the main advantages,
> but being able to avoid asynchronous writes is also another reason.

Managing raw devices (vs. advfs or ufs) is considerable additional effort.
Oracle had no particular qualms about using ufs vs. raw disks (there
are some caveats, such as proper tuning of UBC since Oracle buffers IO's).
We discussed advfs with Oracle before adopting it over ufs in late 1994.
Oracle was concerned with the newness of advfs at that time (as were we),
but the timing of our production use of it (mid-1995) with problems
resolved by 3.2c led us to "gamble" advfs. I think the jury is still out.
If you don't trust your operating system and file system to write data
correctly can you trust it to process the data?

------------------------------------------------------------------------------

That's alot of information (and there's alot more buried behind it).
If somebody has something to add regarding our particular problem,
I would appreciate it. Significant pieces are still in Digital engineering.

Kurt Carlson, University of Alaska
Received on Thu May 23 1996 - 02:15:50 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT