Some time ago I posted a problem with TruCluster V1.5 and the aseagent
process timing out on startup see below. As I got zip back I can only assume
that either I foxed you all or the message disappeared down a black hole!
Anyway I recently had another problem pop up on this node with a slow ftp
connection. I trawled the archives and came up with an summary posted by
Burch Seymour:
http://www.ornl.gov/its/archives/mailing-lists/alpha-osf-managers/1998/05/ms
g00044.html
which pointed to a dodgy entry in /etc/resolv.conf. And guess what it also
seemed to cause my ase problem - probably the timeout in trying to resolve a
duff DNS/BIND server causes the problem. Also these boxes are set to lookup
in DNS then local hosts. If it were the other way round I probably would
never have seen this. Anyway it's one to note for future reference!
Regards,
Dave Campbell
Vodafone Ltd.
e-mail : dave.campbell_at_vf.vodafone.co.uk
Quote of the day:
If at first you don't succeed, then skydiving definitely ain't the sport for
you.
Original Message:
I have an interesting problem that has developed on a two node 8400
available server cluster running 4.0D, TCR 1.5 and patch kit 3. When one of
systems boots into the cluster all appears well, then shortly after the boot
is finished I get:
ASE: local Agent Error: timed out trying to contact the HSM, giving up...
ASE: local Agent Error: fatal error: failed to init agent
ASE: local Agent ***ALERT: main: fatal error...
ASE: local AseMgr Bug: HSM says there are 1 members, I say 2
ASE: local AseMgr Error: Can't get member states from HSM
and the aseagent process drop dead! The agent status from asemgr for this
node then appears as DOWN. If I then manually restart this process
(/usr/sbin/aseagent -b -p hsm) the AM manager kicks in ok, the network paths
come up and the two nodes seem to communicate ok (albeit asemgr on the
manually restarted node appear much slower to start). I've tried switching
over the primary and backup networks within asemgr to see if the fddi
interface I'm using is duff - but no difference. Although this is not
conclusive (as I couldn't bounce both nodes - some users might complain!)
I've had this error through several scheduled reboots on one or the other
node - but this error always seems to occur on the same node. The
frustrating part is that I know all was well up to about a week ago, then
following a few unexpected crashes, this started to happen. I'm also aware
that there are significant changes going on our network at present, and I
have seen some strange effects on the routing tables. Also I haven't, yet,
set up any additional info in the /etc/routes file for ase. So my questions
are:
Can anybody explain what may be going here?
Why does it all appear to work if I hand start it?
Could this problem be caused by the lack of info in the routing tables?
Received on Tue Mar 23 1999 - 13:52:55 NZST