data and file system corruption after a panic

From: Kurt Carlson <SXKAC_at_orca.alaska.edu>
Date: Fri, 10 May 1996 14:54:04 -0800

We have had two panics which resulted in corrupted data, some details
are attached. My questions:

  Has anybody else seen data corruption in a panic?

  Can anybody provide some detail as to why this may occur, how it can
  be detected (after the panic), and how it can be prevented?

  Has anybody else seen a system hang during the 'syncing disks...'
  phase of a panic?

A quick scan of OSF source shows that when 'syncing disks...' the system
is attempting to flush pending writes to the local file systems.
Logic says that if this fails or never completes, one may be subject to
some type of data corruption (as we have seen). Likewise, one may be
subject to corruption with a power failure. In many years I never
saw this type of corruption under MVS or VMS, is Unix flawed in it's
write data caching?

Details:

We are running an 8400 on DU 3.2d-1 with all disks under advfs in
raidsets behind kzpsa's (a10) on hsz40's (v2.5).

Since February we've had six panics 'simple_lock: time limit exceeded'
which seem to be triggered by tape errors. Digital has provided two sets
of patches to cam_tape.o which fix some symptoms of this problem and are
working on patches for other symptoms.

Two times we got these panics we had data corruption. In both cases a
large oracle database was corrupted in a manner that was undetected at a
file system level. The second time also had two advfs filesets with
corruption leading to additional (different) panics. In both cases
long recovery windows (restore and roll-forward) were required. Due
to the undetected corruption we are now very concerned about a panic
on any system leading to corrupted data... if the database restarts
and the corruption is in data not an index, then a roll forward after
a restore may become impossible.

In four of the six panics the 8400 never returned from:

 simple_lock: time limit exceeded
 
     pc of caller: 0xfffffc00004c16ac
     lock address: 0xfffffc0073cad9d0
     current lock state: 0x00000000004c1585 (cpu=0,pc=0xfffffc00004c1584,busy)

 panic (cpu 0): simple_lock: time limit exceeded
 syncing disks...

and ctrl/p at the console was ignored requiring a restart from the front panel.

The last two panics we had no corruption, I don't know if that is just luck
or not. The environmental changes prior to the last two were:

        System activity was lower at times of panics;
        cam_tape.o patches were applied;
        OSF360-018 advfs patches were applied;
        presto-serve was deconfigured in preparation for
          migration to ASE.

When we see this circumstance (hanging during 'syncing disks...') we now
do a full export (several hours) of the oracle databases before restarting
them. If the export fails we know a recover is required. We also do an
'ls -lR /' which had turned up other types of corruption before.
Both these techniques are very crude, does anybody have better techniques
either for detecting corruption or for avoiding corruption?

Kurt Carlson, University of Alaska
Received on Sat May 11 1996 - 01:11:45 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT