Uppsala, 8-AUG-2001
Hi,
I had two very useful reactions:
- tpb_at_doctor.zk3.dec.com "Dr. Thomas.Blinn_at_Compaq.com"
- patchkov_at_ucalgary.ca "Serguei Patchkovskii"
Many thanks to Thomas.Blinn_at_Compaq.com for an extensive explanation on how to
proceed with a crash dump and how to interpret the present situation.
Proposed ways of tackling the problem:
1) Rebuild the kernel (as we are using a generic kernel),
trigger a crash dump and examine this
2) Modify the /sbin/bcheckrc script:
Make sure that your DMS clients mount /usr file system from the master
with hard,nintr attributes - otherwise, you'll experience random system
crashes under heavy load. Unfortunately, there is no way to make /
mounts hard,nintr.
3) We run du/tru64 4.0d. At some point between the base release, and pl8
(the current patch revision), CPQ introduced a silent lock-up bug in
the kernel. Occationally, while running Gaussian, a node will
lock up with no error messages in system log, binary error log, or on
the console. I have never seen it happen while not running Gaussian.
It may be possible that a similar bug was introduced in 4.0e/4.0f
Due to holidays we have no intensive load on our machines just now, but
hope to see some good results soon.
Thanks for all your kind help,
Roger.
**************************************************************************
* Roger Ruber, ruber_at_tsl.uu.se *
* The Svedberg Laboratory, P.O. Box 533, S-75121 Uppsala, Sweden *
* +46 - 18 - 471 3109 (telephone) (facsimile) +46 - 18 - 471 3833 *
**************************************************************************
-------------------------------------------------------------------------------
From: TSL::RUBER "Roger Ruber" 3-AUG-2001 09:09:45.08
To: IN%"tru64-unix-managers_at_ornl.gov"
CC: RUBER
Subj: DMU clients hang without explanation
Uppsala, 3-AUG-2001
Hi,
We have a Digital UNIX cluster consisting of some 30 nodes.
One node is a DMU master, the other nodes boot from this node. Some
of the nodes are used for CPU intensive calculations and I/O work and
hang once a month or so without any trail in the error logs
(uerf, /var/adm/messages, /var/adm/syslog.dated). The other nodes
show no problems whatsoever.
I suspect that it might be due to network access between the client
nodes and the DMU master node. This maybe due to the CPU and I/O
intensive jobs running on these nodes. Does anybody know if this is
a correct guess, and if so, what would be the best way to try to
improve the situation?
The machines in question are:
DMU master: DPWS 600au, running Digital UNIX 4.0E
DMU client 1: DPWS 600au, 4.0E
2: DS10 4.0F
3: DS10 4.0F
4: XP1000 4.0F
The DMU master and clients 1, 2 and 3 are connected to the same Cisco
XL3548 switch with full duplex 100Mb/s connections. The other DMU
clients are connected to similar Cisco switches via a Gigabit backbone.
The remaining nodes are DEC 3000/300 and AlphaStation 200 machines
running Digital UNIX 4.0E and 4.0F. We have no problems with these
machines, only with the clients #1-4.
Thank you for your kind help,
Roger Ruber.
Received on Wed Aug 08 2001 - 12:02:52 NZST