Tru64 Managers:
Thanks for the assistance from list members.
The suggestions were to:
- remove/re-add the quorum disk.
- check for device naming problems using Dsfmgr and hwmgr
Unfortunately the resolution was a little more complicated, but better in
the long run:
The recommendation from Tru64 support was:
1. Delete one of the cluster members from the cluster configuration
2. Upgrade the single-node cluster to Tru64 5.1 patch kit 4
3. Re-add the 2nd member into the cluster
Apparently this situation I had experienced (2nd member can't join cluster)
was a known issue and was resolved by patch kit 4.
After the upgrade to patch kit 4, I also updated the HSG60's to 8.6L cards
and ACS 8.6 software (recently received from Compaq support)
Thanks for the support and assistance!
Robert
-----Original Message-----
Sent: Wednesday, January 16, 2002 9:08 PM
Subject: Cluster 2nd node won't boot
Tru64 Managers:
I have a 2-node ES40 cluster (5.1, patch 3) that has been running fairly
well for the past 9 months.
Today, a node crashed and now, I cannot get both nodes up at the same time.
If node 1 is up, node 2 will crash just after it joins the cluster. If node
2 is up, node 1 will crash after it joins the cluster.
I'm looking for any hints/ideas on how to fix this problem:
On the second node to join the cluster, it crashes and says something about
invalid kernel mode access, but I really can't grab the message since its on
a vga console.
ALSO -- I notice that the CCMAB (memory channel adapters) that tie the nodes
together are different revisions. One is at "rev 23" and one at "rev 24".
Does this difference have any significance? (Seems unlikely, since the
cluster has been operational up to this point.)
By the way, mc_diag and mc_cable work just fine on both nodes.
Here are some of the messages I see on the first cluster member, when the
second member tries to join in:
Jan 16 17:50:00 mars vmunix: kch: resuming activity
Jan 16 17:50:06 mars vmunix: rm_state_change: mchan0 slot 1 offline Jan 16
17:50:06 mars vmunix: rm_lrail_remove_node: logical_rail 0 hubslot 1 Jan 16
17:50:28 mars vmunix: CNX MGR: communication error detected for node 2 Jan
16 17:50:28 mars vmunix: CNX MGR: delay 1 secs 0 usecs Jan 16 17:50:28 mars
vmunix: CNX QDISK: Cluster transition, releasing claim to 1 quorum disk
vote. Jan 16 17:50:28 mars vmunix: CNX MGR: quorum lost, suspending cluster
operations. Jan 16 17:50:28 mars vmunix: kch: suspending activity Jan 16
17:50:28 mars vmunix: dlm: suspending lock activity Jan 16 17:50:28 mars
vmunix: CNX MGR: Reconfig operation complete Jan 16 17:50:28 mars vmunix:
CNX MGR: membership configuration index: 5 (3 additions, 2 removals) Jan 16
17:50:28 mars vmunix: ics_mct: Node 2 is now down Jan 16 17:50:28 mars
vmunix: CNX MGR: Node pluto 2 incarn 0x25878 csid 0x20002 has been removed
from the cluster Jan 16 17:50:28 mars vmunix: CLSM Rebuild: starting... Jan
16 17:50:28 mars vmunix: dlm: resuming lock activity Jan 16 17:50:28 mars
vmunix: kch: resuming activity Jan 16 17:50:28 mars vmunix: CNX QDISK:
Successfully claimed quorum disk, adding 1 vote. Jan 16 17:50:28 mars
vmunix: CNX MGR: quorum (re)gained, (re)starting cluster operations.
Thank you,
Robert Aldridge
Alliance, Ohio
Received on Mon Jan 21 2002 - 13:54:14 NZDT