I've requested for some advice on how to troubleshoot my cluster failover
problem and I received an excellent answer.
Thanks to Steven Hancock. Below is an extract of his reply. My original
questions follow.
As it's difficult at this moment to get a downtime to troubleshoot, I can't
say yet what might cause my problem but I'm sure the pointers from Steve
will lead to a solution.
Regards,
Carlos
Steve's suggestions:
1) The TruCluster 1.x software uses a standard SCSI reservation
mechanism to ensure that multiple initiators do not access to
the device simultaneously. This is particularly true during the
critical period during a service failover. The device connected
as your shared devices must support the SCSI reserve and release
functions. We have had a couple of cases where a third-party
vendor did not support this and we had to leave the problem with
that vendor to solve. I suggest checking with EMC and making sure
they support this and you have their latest firmware installed.
2) In order to validate the problem, see if you can get the
problem to reproduce then first try to read the disklabel on
the device and be sure that is working. The disklabel command
will fail with an "I/O Error" usually when a reservation problem
exists. Next, try to use the SCU command to issue a SCSI release
command similar to the following:
# scu -f /dev/rrz120a
(scu) release device
(scu) quit
Then try to see if you can read the disklabel again. If it succeeds,
you have validated your theory with respect to the storage.
3) If you validated the SCSI reserve problem from #2 and the vendor
is saying they are okay from #1, then make sure you have the
AUTO_ACTION console variable to "HALT".
4) Set the logging level of your ASE to "Informational" from the
default level of "Notice" and reproduce it again. Then, check the
kern.log and daemon.log carefully for other clues as to what the
problem could be. Sometimes, its not what you think.
5) If the problem is not a storage reservation issue, you should
make sure your start/stop scripts and timeouts are set appropriately
for bullet-proof failover. We find many customers don't have these
properly set up or tested and don't work when the time for failover
comes.
6) Install latest Tru64 and TruCluster patches on case there is
something we've already fixed that solves this problem. It may not
always be obvious from the patchkit README's whether a patch will
be relevant to your problem or not. We strongly recommend you
install all of the patches even though the patch tool will allow
you to install them individually.
>From: Carlos Chua <chuacarlos_at_hotmail.com>
>To: tru64-unix-managers_at_ornl.gov
>Subject: Cluster service won't relocate automatically
>Date: Sun, 15 Oct 2000 21:10:03 +0000 (SGT)
>
>Hi,
>
>I currently have a cluster where I can relocate a disk service manually
>thru
>asemgr, but cannot get the service to failover automatically when one of
>the
>servers is down. I tried the automatic failover by shutting down the server
>(NodeA) while the service is still online in order to simulate a server
>failure. A check on the daemon.log on the other server (NodeB) shows these
>messages:
>
>NodeB ASE: NodeB Agent Notice: Starting service PP1-sap
>NodeB ASE: NodeB Agent Notice: didn't reserve device
>NodeB ASE: NodeB Error: AM can't reserve device /dev/rz120g
>NodeB ASE: NodeB Agent Warning: can't reserve /dev/rz120g
>
>
>It seems to me that during a manual relocation, NodeA will release the
>device so that NodeB can takeover. However, a shutdown on NodeA doesn't
>seem
>to be able to release the device even if I tried to power off NodeA.
>
>The disk service favored member list contains both NodeA and NodeB, so I
>think the setup should be Ok.
>
>Below are my environment:
>NodeA - GS60, UNIX V4.0F, Trucluster V1.6
>NodeB - ES40, UNIX V4.0F, Trucluster V1.6
>Storage controller - KZPBA, with EMC Powerpath V1.5.0.6
>Storage - EMC Symmetrix
>
>Any advice or suggestion on how to troubleshoot this problem is greatly
>appreciated.
>
>Thanks in advance,
>
>Carlos
>chuacarlos_at_hotmail.com
>_________________________________________________________________________
>Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
>
>Share information about yourself, create your own public profile at
>http://profiles.msn.com.
>
_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at
http://www.hotmail.com.
Share information about yourself, create your own public profile at
http://profiles.msn.com.
Received on Wed Oct 18 2000 - 07:46:31 NZDT