Greeting Managers,
I have a unique problem which I don;t know how to solve
it. We are currently running DU V4.0D and we have a
monitor program that runs on a remote host and uses
"rexec" system call to do a general health check of
our alpha system.
The following is the scenario I run into:
The monitor program has a timeout built into it where the
system call (rexec) is interrupted if there is no reponse
within a certain time period. What this does to our DU V4.0D
alphaserver is leave a "rshd" that uses 100% cpu (as shown by
top) and does not die!! This happens ONLY on digital Unix
server, the monitor programs does the same thing on suns,
crays etc, but they do not show this symptoms at all.
This usually happens when our Digital Unix system is running
low on memory and it takes a long time to execute the "rexec"
calls which in turns prompts the "monitor" program to timeout
and thus leaving these runsway "rshd"???
Apart from a "cron" job solution, is this behaviour normal?
Or is it a bug in the "rshd" for Digital Unix systems??
Any suggestions or ideas are welcome to solve the problem?
We have tried increasing the timeout value on the monitor
program but every now and then we run into this!
Thanks for any help
Mohan
Received on Tue Dec 01 1998 - 16:26:44 NZDT