SOLUTION:
It turns out that the NIS factor was a red herring. From examining the rexec
protocol exchange between the client and server, using tcpdump, I was able
to determine that the server was dropping the connection the moment it
received the secondary port number from the client.
Closer examination of /shlib/libc.so (where rexec() resides) and rexecd on
both the client and server boxes showed that the server has a more recent
version of both rexec() and rexecd.
This latest version appeared in patch kit 6, where rexecd was modified to
accept longer username and password strings so as to be compatible with Win
NT, and in doing so rexecd was "broken".
Applying patch 789 from the latest aggregate patch kit solved the problem
and the server immediately started being well behaved again.
A big thank you to Sergio Gelato at the University of Stockholm Astronomy
Department for gently and patiently walking me through the process of
interpreting the output of tcpdump and for getting the correct information
out of tcpdump in the first place.
Once again, this list's members have come up trumps with help, support and
advice.
Gary
ORIGINAL PROBLEM:
> We're coming across a problem with one of our applications
> which makes use
> of the rexec() system call.
>
> One of our processes acts as a "watchdog", starting,
> restarting and stopping
> a series of other processes. These processes are fired up
> using rexec(), and
> specifying that rexec should set up an auxillary channel to
> the created
> process (ie: the err_file_desc argument to rexec() is non-null).
>
> On our test boxes this code works perfectly but when we
> transfer the code
> onto our integration test boxes the code hangs in rexec(). A bit of
> debugging shows that rexec is waiting in accept(), presumably
> waiting for
> the remote server to contact the client in order to set up the control
> channel.
>
> Both boxes are GS40's running 4.0f; the only difference seems
> to be that the
> integration boxes are running NIS and the development boxes aren't.
>
> What's stranger is that trying to rexec() an app on the test
> box from the
> test box works and trying to rexec() an app on the test box from the
> integration box works; but no attempt to rexec() an app on
> the integration
> box, either locally, ie: both the app and watchdog on the
> integration box,
> or remotely ie: the app on the integration box and the
> watchdog on the test
> box, works at all.
>
> Seeing as this "hang" is occurring, the remote client doesn't
> get started
> up; I'm wondering whether there's some reverse lookup
> occuring in the remote
> client which is failing due to a difference in non-NIS and NIS host
> resolution.
--
Gary Gale Mail: gary.gale_at_factiva.com
UK Server Group Phone: +44 (0) 207 542 8814
Factiva, A Dow Jones & Reuters Company Web: www.factiva.com
Received on Thu Feb 07 2002 - 12:03:54 NZDT