I apologize in advance for the vagueness of this question. Not even sure
how to start specifying all the possibly relevant information. Just
hoping that this rings a bell and somebody can suggest a place for me to
start looking. Also, I'm in meetings all day and may be slow in
responding to anything that comes in...
I have 4 ES40s and a GS320. All running V5.1 and members of a TruCluster
(V5.1). Things generally run quite smoothly but when there's any kind of
a problem on just one machine it seems to escalate and cause problems on
others. Here are some of the current symptoms:
-automounter doesn't work (was saying file or directory does not exist;
now seems to just hang on a request) even though the daemon is running
on each machine (I've tried stopping and starting automount, nfsmount,
and nfs)
-I *can* mount NFS disks by hand (the same external disks that the
automounter can't mount)
-despite the fact that I have NFS disks mounted (I can cd to them, edit
files etc), "df" just plain hangs though a "df -t advfs" will show all
the local AdvFs disks; as far as I can tell, I can access *every* NFS
disk mounted in fstab
-not positive this is related but it corresponds in time; ssh daemons
(not the client logins) suddenly start running out of control; after
some time there may be 4-8 of them, burning all the CPU, and effectively
preventing any logins except at the console
This last thing is particularly irksome as it prevents logins. I need to
go onto each machine, kill every ssh daemon, stop and start ssh, and
hope. Generally, it seems that two full passes of this on each of the 5
machines solves the problem.
What triggered all this? Not sure but this seems to be the chain this
last time. User runs a *large* job requesting at least 4GB (might have
been 13GB) on an ES40 with 4GB RAM and 4GB swap. Machine effectively
hangs (complains hat swap is very low) and nobody can connect to other
machines (presumably because of the ssh daemons running amok). I reboot
the one ES40 (not being quite clear on what is happening). Machine comes
up fine but the ssh daemons are running crazy on all the machines.
Finally clear up the daemons and then discover that automounter isn't
working....
Any ideas much appreciated! When this type of thing has happened before
the only solution seemed to be rebooting every cluster member....
Chris
--
----------------------------------------------------
Chris Loken Phone: 416-978-5619
Computing Facility Manager Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics
Received on Fri Jun 01 2001 - 12:58:34 NZST