Cluster 2nd node won't boot

From: Aldridge, Robert E. <REAldridge_at_mcdermott.com>
Date: Wed, 16 Jan 2002 20:08:23 -0600

Tru64 Managers:

I have a 2-node ES40 cluster (5.1, patch 3) that has been running fairly
well for the past 9 months.

Today, a node crashed and now, I cannot get both nodes up at the same time.
If node 1 is up, node 2 will crash just after it joins the cluster. If node
2 is up, node 1 will crash after it joins the cluster.

I'm looking for any hints/ideas on how to fix this problem:

On the second node to join the cluster, it crashes and says something about
invalid kernel mode access, but I really can't grab the message since its on
a vga console.


ALSO -- I notice that the CCMAB (memory channel adapters) that tie the nodes
together are different revisions. One is at "rev 23" and one at "rev 24".
Does this difference have any significance? (Seems unlikely, since the
cluster has been operational up to this point.)

By the way, mc_diag and mc_cable work just fine on both nodes.

Here are some of the messages I see on the first cluster member, when the
second member tries to join in:


Jan 16 17:50:00 mars vmunix: kch: resuming activity
Jan 16 17:50:06 mars vmunix: rm_state_change: mchan0 slot 1 offline
Jan 16 17:50:06 mars vmunix: rm_lrail_remove_node: logical_rail 0 hubslot 1
Jan 16 17:50:28 mars vmunix: CNX MGR: communication error detected for node
2
Jan 16 17:50:28 mars vmunix: CNX MGR: delay 1 secs 0 usecs
Jan 16 17:50:28 mars vmunix: CNX QDISK: Cluster transition, releasing claim
to 1 quorum disk vote.
Jan 16 17:50:28 mars vmunix: CNX MGR: quorum lost, suspending cluster
operations.
Jan 16 17:50:28 mars vmunix: kch: suspending activity
Jan 16 17:50:28 mars vmunix: dlm: suspending lock activity
Jan 16 17:50:28 mars vmunix: CNX MGR: Reconfig operation complete
Jan 16 17:50:28 mars vmunix: CNX MGR: membership configuration index: 5 (3
additions, 2 removals)
Jan 16 17:50:28 mars vmunix: ics_mct: Node 2 is now down
Jan 16 17:50:28 mars vmunix: CNX MGR: Node pluto 2 incarn 0x25878 csid
0x20002 has been removed from the cluster
Jan 16 17:50:28 mars vmunix: CLSM Rebuild: starting...
Jan 16 17:50:28 mars vmunix: dlm: resuming lock activity
Jan 16 17:50:28 mars vmunix: kch: resuming activity
Jan 16 17:50:28 mars vmunix: CNX QDISK: Successfully claimed quorum disk,
adding 1 vote.
Jan 16 17:50:28 mars vmunix: CNX MGR: quorum (re)gained, (re)starting
cluster operations.


Thank you,

Robert Aldridge
Alliance, Ohio
Received on Thu Jan 17 2002 - 02:08:48 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:43 NZDT