LSF started misbehaving on one of our clusters; in this case a 6-node
ES45 cluster, Tru64 5.1B PK 2. I discovered that LSF daemons could be
contacted from outside the cluster, but not from any machine inside the
cluster.
Examining /etc/clua_services, I discovered that it was missing the
lines telling the cluster alias about the ports, so I added the lines:
#
# LSF Ports
#
lim 3879/tcp in_noalias,static
res 3878/tcp in_noalias,static
mbatchd 3881/tcp in_noalias,static
sbatchd 3882/tcp in_noalias,static
mbdquery 40001/tcp in_noalias,static
ran 'cluamgr -f' on all nodes, and restarted LSF on all 6 nodes for
good measure.
But the strange behaviour still continues. If I try to connect to one
of these ports from outside the cluster it works:
16:52:51 tjrc_at_ecs4d:~$ telnet ecs2d 3882
Trying 172.17.1.204...
Connected to ecs2d.
But if I try to connect from within the cluster, the operation times
out:
16:53:30 tjrc_at_ecs2c:~$ telnet ecs2d 3882
Trying 172.17.1.204...
telnet: Unable to connect to remote host: Connection timed out
Any ideas, short of rebooting the cluster, which I am reluctant to do?
Many thanks,
Tim
--
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233
Received on Tue Feb 22 2005 - 17:05:40 NZDT