Hi,
Some of you may be aware I have been fighting a strange Legato hang up
which occurs right at the beginning, or start, of my Alpha123-backup
group. This group gets started manually by Operations at night using
the nwadmin GUI. They would start the job but it would stall for one or
more clients. In other words that client's entire backup process
appears to stall or hang there in limbo not doing anything, not
creating any error messages, but none the less not doing the job we
wanted it to do which is to create good backups. This client stall or
hang may also prevent the entire backup group from finishing which
affects the post processing section of the Savepnpc logic too.
We believe this problem is now resolved.
Our latest theory is the Oracle database were not coming down and this
is why the backups were not starting and these symptoms were prompting
our Operations unit to boot the server or servers to make the backup run
correctly.
We use Legato's Savepnpc facilities to shutdown and restart our
databases. Here is a link from April, 1999 which describes what exactly
Savepnpc is , how to use it, and common problems often seen when using
it.
Subject: SUMMARY: NSR savepnpc - what am I doing wrong.
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/1999/04/msg00028.html
We would now like to add another common problem to our list of common
problems which we created in April of 1999.
On August 4 we tried the following experiment which is explained in
more detail in the next summary.
Subject: SUMMARY: How do I automatically start up the V2 Oracle
Listener after a re-boot?
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2001/08/msg00082.html
Anyway I'll cut and paste the important snippet below. Applied
correctly this will fix our problem.
We changed the shutdown command in $ORACLE_HOME/bin/dbshut from a
NORMAL shutdown which is the default to
an immediate shutdown. I'm not sure if this is a "good practice" or
not but we are giving it a try. We changed a line of code in this
script
from #shutdown to #shutdown immediate. Some day we may back out of
this change. The /sbin/init.d/oracle script we are using brings down
the
Listener first then it brings down the database. We are currently
experimenting with an IMMEDIATE shutdown. I'm not a DBA but it seems
to
me since we have already pulled the "listening rug" out from under the
user processes there might now not be much difference between a NORMAL
and IMMEDIATE shutdown so I don't know if we are buying anything with
this change. We will watch it and monitor for problems. If you guys
think there is a problem with using IMMEDIATE please let me know.
Check this out. There was a note labeled "Additional Information" in
the
Oracle metalink document and I'll quote it for you now.
"The default shutdown performed by dbshut is a normal "shutdown".
If users are still logged in, dbshut will 'hang' until all users
have logged off the database. You may want to alter the script
so that is does a shutdown immediate."
On August 10, 2001 I still don't have the "immediate" thing right yet
and I'm really starting to get frustrated. You can see I am still
crying about this problem at the following link
Subject: Strange Networker Hang could be DNS related or could there be
another theory?
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2001/08/msg00169.html
What did I do wrong? It was a simple change, you just change one line
of code from "shutdown" to "shutdown immediate"
Well it appears I did my analysis on a piece of Oracle 8.1.7 dbshut
code which only contains one (1) reference to shutdown. I am a bottom
up programmer and when I went into another more important piece of
dbshut code from an Oracle 8.0.6 database, from the bottom up as well, I
quickly found and changed the same line of code, or what I thought was
the same line of code but apparently my "Shutdown Immediate" experiment
from 08-04-2001 was applied to the wrong branch of ORACLE shutdown code.
It was one of those If-then-else pieces of logic and I made changes to
the wrong branch of code. I have now made adjustments to the other side
of this shutdown logic and we are seeing the results we originally
expected to see on August 4, 2001.
We are reporting there were no client hangs last night so we think we
beat the problem. Yup, it worked one-night-in-a-row for us, which is a
new record, and we are declaring the problem beat. Of course the Oracle
DBAs still need to look into the cause of those long running Oracle
processes and determine whether this is a problem or not. This could be
the result of bad applications logic, an inefficient database index, or
a two-hour batch job that I now clobber with my new shutdown immediate
logic. Either way 1, 2, or 3, I think the DBAs need to investigate a
little now because I'm standing behind my fix.
I am still a Legato fan and I just love this software.
I hope with all recent my crying I have not given you the impression
this product is not stable. On the contrary, I think Legato release
6.02 build 251 is good to go with Tru64 UNIX 5.1 patch kit-003 in a
non-clustered environment using Oracle releases 8.0.6 or 8.1.7 when
backing up these databases from a cold state using Legato's Savepnpc
facilities as outlined above. That's my environment and I can only make
recomendations based on my own experience which is not to say simular or
different combinations of hardware or software levels may or may not be
good to go for you. I'm just trying to undue any wrong impressions
which may have been emanating from my keyboard due to all my recent
whining about this problem. I am not endorsing anything.
Problem solved, I'm a happy camper now. I just love Legato.
Sincerely
Kevin Criss
Received on Thu Aug 16 2001 - 16:30:47 NZST