Cluster hang

From: Rick Beebe <richard.beebe_at_yale.edu>
Date: Wed, 20 Mar 2002 21:35:15 -0500

We have several new or updated clusters running Tru64 5.1a which have
exhibited some disturbing behavior. Seemingly at random, once a month or
so, some process will lock some resource causing bunches of other
processes to hang in 'U' (uninterruptable sleep) state waiting for the
resource to free.

On the first occasion, we noticed a bunch of mail delivery processes
attempting to delivery to a single user, say abc. Any attempt to access
abc's home directory (we deliver mail there rather than to spool/mail)
would cause that session to lock up. I.e. 'ls ~abc' would freeze up the
terminal and there was no escape. When they say 'uninterruptable' they
mean it. There were also a bunch of IMAP processes. In that case, while
puzzling over this, I created a new home directory for abc so that new
incoming mail could be delivered and we'd hopefully stop the backlog of
processes. Shortly thereafter, with a great virtual whoosh, the resource
released, all the mail got delivered and all the IMAP processes went
away. Our stuck terminal sessions also freed up.

It's happened a few more times, though we've never been able to identify
what resource everything is waiting for and it usually cleared up in 20
minutes or so. Today, however, one of the nodes on our mail email server
experienced it and our MTA (PMDF from Process Software) was totally
locked up on it. After an hour and a half we crashed the machine. We've
put in a call to Compaq service but I was wondering if anyone had any
ideas on what we might use to identify what resource is locked and/or
what processes has it locked. I've used lsof but if it will tell me, I
haven't found the magic incantation yet.

Has anyone else has experienced this problem?

-- 
 
_______________________________________________________________________
      Rick Beebe                                            (203)
785-6416
      Manager, Systems & Network Engineering           FAX: (203)
785-3978
      ITS-Med Production Services                  
Richard.Beebe_at_yale.edu
      Yale University School of Medicine
      Suite 124, 100 Church Street South, New Haven, CT 06519
   
 
_______________________________________________________________________
Received on Thu Mar 21 2002 - 02:33:15 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:43 NZDT