Summary
The original question was:
We're running Oracle 7.3.4 on a double 4100 (cluster) under Tru64 4.0d/1.5.
Both OS and DBMS are installed locally, the database itself resides on a
RAID, which is completely configured as a ASE service. I created start/stop
scripts that will actually start and stop all three databases and the Oracle
listener. There's only one 4100 by which the database is accessable, the
other 4100 is "standby" and will take over if the active machine fails.
I observered an unplanned fail-over once (due to machine check!) and since
nothing has been changed in the working environment, it is to be expected to
work again.
It didn't.
Alas, I was on holiday at the time so I only can tell from "hear tell" what
happened.
The one machine that runs Oracle failed - again, due to a machine check, I
found out) but the database was NOT taken over by the other machine - at
least, no data could be accessed. What my collegues did was restart both
machines, it seems the service has ben started on the preferred node, but
one database (the most important one) did not start automatically. Logging
is as Oracle, it was able to start it manually.
I can't find ANYTHING on what caused the failure. On the standby node, I
found Oracle files left, of two databases, not of the most important one. So
it is possible that the service has started - just not the database.
When cheching, I found cmon (the graphical cluster monitor) signals "No ASE
reports received" and will show both(!) nodes as "unavailable" - where both
are up and running.....asemgr however works fine.
So:
1 Where can I find information on the failover (in the system) - even after
some time?
2 Why does cmon signal trouble where there is none?
3 How to solve (better: prevent) this to happen again?
The summary
===========
Thanks to
Jochen Van de Perre (Cronos) [Jochen.VanDePerre_at_UCB-Group.com]
David J. DeWolfe [sxdjd_at_java.sois.alaska.edu]
Dmitry Trikoz [d3koz_at_auriga.com]
Danielle Georgette [Danielle.Georgette_at_asx.com.au]
Viktor Holmberg [Viktor.Holmberg_at_tnsofres.com]
who all contributed in some way.
Deducting from what I found on the system, available Oracle knowledge,
mentioned people, and quite a big deal of intuition, the problem could have
been in the time-out on the scripts - or some weird Oracle problem. Not all
the questions have been answered but I'm well prepared for the next
fail-over - I hope.
On question 1:
The location where to look is /var/adm/syslog.dated/<date>/daemon.log, as
most repliers pointed out. I know this for the next time! Next time, have it
saved completely!
As some pointed out, it should be able in the Oracle environment to find
something - so I'll dig somewhat deeper.
Viktor pointed out it's a bad habit to keep Oracle files on local machines.
I know, but that's the way the system has been delivered - and there were
good reasons to do this at the time. The extra work (if any...) is taken as
it is.
On 2:
Is cmon so little known that nobody could answer this ... I can live with it
but I still wonder why this happens.
On 3:
As pointed out: KEEP THE LOGFILES, so prevent them to be deleted.
Test - and test again the fail-over, until all works fine.
Willem Grooters
Sema Group Informatica BV
Managed Services
Tel. +31 (0)294 239 500
Fax. +31 (0)294 239 501
E-mail Willem.Grooters_at_sema.nl
------------------------------------------------------------------------
This e-mail is confidential and intended solely for the use of the
individual(s) to whom it is addressed. Any views and opinions presented
are solely those of the author and do not necessarily represent those of
Sema Group.
If you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing or copying of this e-mail is strictly prohinited.
If you have received this e-mail in error please notify Sema Group
Informatica by telephone +31 (0)294 239 500
------------------------------------------------------------------------
Received on Fri Aug 25 2000 - 15:07:02 NZST