We had several suggestions that is was likely to be a quorom disk problem.
Phil Lawrence from Compaq has confirmed by reading the crash dump and
system.dated/daemon.log files that it was in fact caused by cluster_lockd
deamon failing beforehand. This was probably caused by a faulty network card
locking a network switch port, and the fact that we had been pulling cables
all morning to test the resilience complicated matters.
We feel more relaxed now.
-----Original Message-----
From: Colin Bull [mailto:c.bull_at_videonetworks.com]
Sent: 24 October 2001 16:14
To: Tru64-Unix-Managers
Subject: Second cluster member locks up
We are running Tru64 5.1 PK3 with a 2 server cluster, BM1 - ES40 3GMmemory,
BM2 - DS20E 2GB memory.
Both servers are connected to each other via 2 memory channel adapters to 2
memory channel
hubs. The are connected to a SAN by dual redundant fibre to 2 HSG80s.
We had a previous experience where BM1 had a faulty PCI motherboard changed,
and as it came up after repair the BM2 server locked. Any sessions just
hung, including telnet
and console sessions.
Today, as part of our User Testing, the power was pulled on the BM1 and
again the BM2 locked
solid. All telnet session and the console screen just locked. The direct IP
address
could be pinged, but the cluster and application IPs failed the ping.
The BM1 started rebooting and complained cfs_kgs_submit_join_proposal
cluster_root failed over and over again.
After an hour we reset both servers and they both came up.
Any suggestions ?
Colin Bull
DBA 2nd Floor Icon
VideoNetworks.com Watch over 1000 films AND Premier League football when
ever you want
Tel 01438 363496
Received on Wed Oct 31 2001 - 12:15:10 NZDT