Hello gurus,
A cry for help from an frustrated administrator (me) and the Generic NQS
maintainer (Stuart Herbert <s.herbert_at_sheffield.ac.uk>). Please help.
We are currently using the Generic Network Queueing System (GNQS) on a
cluster of Digital UNIX workstations running V3.2c, V4.0b and V4.0c all
*unpatched*. The system was configured correctly and worked fine for
several months; jobs could be submitted to local and remote queues (via
qsub) and users could quickly determine the state of the queues (via
qstat). For example, to check the state of remote queue the localhost
issues a request to all remote workstations and the state is return in a
TCP packet (after some handshaking).
Soon after I took over the administration, I applied patches to two
workstations named alpha6 (AlphaStation 600 with V4.0b patch level 6) and
alpha9 (Personal Workstation 433au with V4.0c patch level 3). These
particular patches are required to fix a major problem with the debugging
part of the kernel. However, since applying these patched the remote host
does *not* send the state of it's queue back to the localhost when
requested.
What have these patches done???? Clearly patching the system is
responsible since other unpatched workstations with DU V3.2c, V4.0b and
V4.0c are unaffected.
Stuart Herbert (the GNQS maintainer) has kindly looked at this problem for
some time and has produce several debugging patches in an attempt to
resolve what is happening. We have twice used a sniffer to trace the
packets between machines.
Below are some selected highlights from correspondence with Stuart Herbert
(where pol1 is a AlphaStation 1000 V3.2c *unpatched*)
-- cut --
1) The way NQS was designed to report this information back is very
nasty ... basically, the entire output is generated at the remote
host, and qstat merely copies it from the network straight to the
terminal.
2) The TCP/IP dumps show that alpha6 actively closes the network
connection to alpha9 very quickly after sending the TCMP_CONTINUE packet
to alpha9. This is in contrast to the conversation with pol1, where after
the TCMP_CONTINUE packet is sent to pol1, there is a much longer delay,
and then the data is sent down the wire.
3) According to the log, the NQS netdaemon thinks that it is writing the
information to the network when qstat'd from alpha9, exactly as it does
when qstat'd from pol1. However, we know that the information never goes
onto the network back to alpha9.
This is the strange thing ... I wonder just what is happening to the
data that alpha6 thinks it is writing to the network? The original
TCMP_CONTINUE packet back from alpha6 to alpha9 gets through, which
strongly suggests that the network socket is fine at this point.
-- cut --
Any help would be very appreciated. Further information can be supplied.
The GNQS homepage is:
http://www.shef.ac.uk/~nqs.
Full summary to follow.
Regards,
Rich
/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/ _ \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\
/_/ Richard A Bemrose /_\ Polymers and Colloids Group \_\
/_/ email: rb237_at_phy.cam.ac.uk /_\ Cavendish Laboratory \_\
/_/ Tel: +44 (0)1223 337 267 /_\ University of Cambridge \_\
/_/ Fax: +44 (0)1223 337 000 /_\ Madingley Road \_\
/_/ (space for rent) / \ Cambridge, CB3 0HE, UK \_\
/_/_/_/_/_/_/
http://www.poco.phy.cam.ac.uk/~rb237 \_\_\_\_\_\_\
"Life is everything and nothing all at once"
-- Billy Corgan, Smashing Pumpkins
Received on Wed Jun 03 1998 - 12:47:41 NZST