SUMMARY: DMU clients hang without explanation from Roger Ruber on 2001-08-09 (tru64-unix-managers)

From: Roger Ruber <RUBER_at_tsl.uu.se>
Date: Wed, 08 Aug 2001 14:01:31 +0100 (MET)

                                                          Uppsala, 8-AUG-2001

Hi,

I had two very useful reactions:
- tpb_at_doctor.zk3.dec.com "Dr. Thomas.Blinn_at_Compaq.com"
- patchkov_at_ucalgary.ca "Serguei Patchkovskii"

Many thanks to Thomas.Blinn_at_Compaq.com for an extensive explanation on how to
proceed with a crash dump and how to interpret the present situation.

Proposed ways of tackling the problem:

1) Rebuild the kernel (as we are using a generic kernel),
   trigger a crash dump and examine this

2) Modify the /sbin/bcheckrc script:
   Make sure that your DMS clients mount /usr file system from the master
   with hard,nintr attributes - otherwise, you'll experience random system
   crashes under heavy load. Unfortunately, there is no way to make /
   mounts hard,nintr.

3) We run du/tru64 4.0d. At some point between the base release, and pl8
   (the current patch revision), CPQ introduced a silent lock-up bug in
   the kernel. Occationally, while running Gaussian, a node will
   lock up with no error messages in system log, binary error log, or on
   the console. I have never seen it happen while not running Gaussian.
   It may be possible that a similar bug was introduced in 4.0e/4.0f

Due to holidays we have no intensive load on our machines just now, but
hope to see some good results soon.

Thanks for all your kind help,

                                Roger.

  **************************************************************************
  * Roger Ruber, ruber_at_tsl.uu.se *
  * The Svedberg Laboratory, P.O. Box 533, S-75121 Uppsala, Sweden *
  * +46 - 18 - 471 3109 (telephone) (facsimile) +46 - 18 - 471 3833 *
  **************************************************************************
-------------------------------------------------------------------------------
From: TSL::RUBER "Roger Ruber" 3-AUG-2001 09:09:45.08
To: IN%"tru64-unix-managers_at_ornl.gov"
CC: RUBER
Subj: DMU clients hang without explanation

                                                          Uppsala, 3-AUG-2001

    Hi,

    We have a Digital UNIX cluster consisting of some 30 nodes.
    One node is a DMU master, the other nodes boot from this node. Some
    of the nodes are used for CPU intensive calculations and I/O work and
    hang once a month or so without any trail in the error logs
    (uerf, /var/adm/messages, /var/adm/syslog.dated). The other nodes
    show no problems whatsoever.

    I suspect that it might be due to network access between the client
    nodes and the DMU master node. This maybe due to the CPU and I/O
    intensive jobs running on these nodes. Does anybody know if this is
    a correct guess, and if so, what would be the best way to try to
    improve the situation?

    The machines in question are:
    DMU master: DPWS 600au, running Digital UNIX 4.0E
    DMU client 1: DPWS 600au, 4.0E
               2: DS10 4.0F
               3: DS10 4.0F
               4: XP1000 4.0F
    The DMU master and clients 1, 2 and 3 are connected to the same Cisco
    XL3548 switch with full duplex 100Mb/s connections. The other DMU
    clients are connected to similar Cisco switches via a Gigabit backbone.
    The remaining nodes are DEC 3000/300 and AlphaStation 200 machines
    running Digital UNIX 4.0E and 4.0F. We have no problems with these
    machines, only with the clients #1-4.

    Thank you for your kind help,

                                   Roger Ruber.
Received on Wed Aug 08 2001 - 12:02:52 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT