Uppsala, 3-AUG-2001
Hi,
We have a Digital UNIX cluster consisting of some 30 nodes.
One node is a DMU master, the other nodes boot from this node. Some
of the nodes are used for CPU intensive calculations and I/O work and
hang once a month or so without any trail in the error logs
(uerf, /var/adm/messages, /var/adm/syslog.dated). The other nodes
show no problems whatsoever.
I suspect that it might be due to network access between the client
nodes and the DMU master node. This maybe due to the CPU and I/O
intensive jobs running on these nodes. Does anybody know if this is
a correct guess, and if so, what would be the best way to try to
improve the situation?
The machines in question are:
DMU master: DPWS 600au, running Digital UNIX 4.0E
DMU client 1: DPWS 600au, 4.0E
2: DS10 4.0F
3: DS10 4.0F
4: XP1000 4.0F
The DMU master and clients 1, 2 and 3 are connected to the same Cisco
XL3548 switch with full duplex 100Mb/s connections. The other DMU
clients are connected to similar Cisco switches via a Gigabit backbone.
The remaining nodes are DEC 3000/300 and AlphaStation 200 machines
running Digital UNIX 4.0E and 4.0F. We have no problems with these
machines, only with the clients #1-4.
Thank you for your kind help,
Roger Ruber.
**************************************************************************
* Roger Ruber, ruber_at_tsl.uu.se *
* The Svedberg Laboratory, P.O. Box 533, S-75121 Uppsala, Sweden *
* +46 - 18 - 471 3109 (telephone) (facsimile) +46 - 18 - 471 3833 *
**************************************************************************
Received on Fri Aug 03 2001 - 07:10:39 NZST