Thanks to Henk kalle, Dr. Alan Rollow, Martin Vasas and Bryan Williams for your help and suggestions. Below is a list of responses/suggestions in
no particular order...
_________________________________________________________________
If there was a hardware problem, I would expect to see disk errors in
the binary error log or domain panics in the messages file. There are
several Oracle patches for data file corruption. Check with oracle for
your version.
_________________________________________________________________
There is an
oracle Unix utility called dbverify which can be used to check the
datafiles.
__________________________________________________________________
there is a init<sid>.ora parameter to fix this problem.
edit the init<isd>.ora file : _tru64_directio_disabled=true
restart the the database
recover database.
__________________________________________________________________
Oversimplying a bit (or a lot), the two general causes of
this type of data corruption are:
o Software bug, usually dealing with process interlocks
that should prevent two processes from trying to write
the same data at the same time.
o Undetected hardware errors.
About all you can do on the software is ensure that all
the most recent patches are installed, in the hope that
the relevant have found the software problem and fixed
it. Some software systems, such as a database, may have
features that can be enabled to double check their own
writes and data to help detect and correct errors. I
don't have a clue if Oracle has such a feature.
The hardware side is even harder, since the cause of the
problem was an "undetected" error. Each point in the
typical data path has checks that ensure the data going
across it is the expected data. Some data paths are
parity protected, because they're sufficiently reliable
that a double bit error (undetectable) is to rare to
worry about. Others are ECC protected, so they can at
least detect multi-bit errors and correct single bit
ones. Some subsystems support extra levels of checking
such as reading the data after being written and then
checking that against the original.
The rate of undetected data errors is designed to be
very low for each part of the system. But, it is a
game of chance; even with a low probability, one of
them is going happen somewhere. Move enough data and
you're bound to see one eventually. In the total
universe of data being moved, your problem may have
been undetected data corruption, that got noticed by
the part of Oracle that checks the format and content
of this file.
If you have appropriate support services, you probably
want to bring the data corruption to the attention of
all the vendors involved. How the data was wrong can
offer a clue what sort of corruption it was. Data that
look appropriate for another part of the same file, or
part of a different file, is quite different from a
couple of bits being swapped in a single byte of data
somewhere.
__________________________________________________________________
----- Forwarded by David Knight/CLUBCORP/US on 10/24/2003 02:51 PM -----
David Knight
10/24/2003 11:04 AM
To: tru64-unix-managers_at_ornl.gov
cc:
Subject: Oracle DBF file corruption / ADVFS
Hello Managers,
We recently experienced corruption in an oracle dbf file and I am
trying to insure that it is not due to any Unix/hardware issue. I have no
errors in my messages/sys.log files. O/S Version: Tru64 5.1 / Tru Cluster
5. (ADVFS) Any recommendation for insuring that this is not a HW/OS issue
or ways to check the file at the O/S Level /etc would be much appreciated?
Thanks,
David
Received on Fri Oct 24 2003 - 20:00:55 NZDT