We have several new or updated clusters running Tru64 5.1a which have
exhibited some disturbing behavior. Seemingly at random, once a month or
so, some process will lock some resource causing bunches of other
processes to hang in 'U' (uninterruptable sleep) state waiting for the
resource to free.
On the first occasion, we noticed a bunch of mail delivery processes
attempting to delivery to a single user, say abc. Any attempt to access
abc's home directory (we deliver mail there rather than to spool/mail)
would cause that session to lock up. I.e. 'ls ~abc' would freeze up the
terminal and there was no escape. When they say 'uninterruptable' they
mean it. There were also a bunch of IMAP processes. In that case, while
puzzling over this, I created a new home directory for abc so that new
incoming mail could be delivered and we'd hopefully stop the backlog of
processes. Shortly thereafter, with a great virtual whoosh, the resource
released, all the mail got delivered and all the IMAP processes went
away. Our stuck terminal sessions also freed up.
It's happened a few more times, though we've never been able to identify
what resource everything is waiting for and it usually cleared up in 20
minutes or so. Today, however, one of the nodes on our mail email server
experienced it and our MTA (PMDF from Process Software) was totally
locked up on it. After an hour and a half we crashed the machine. We've
put in a call to Compaq service but I was wondering if anyone had any
ideas on what we might use to identify what resource is locked and/or
what processes has it locked. I've used lsof but if it will tell me, I
haven't found the magic incantation yet.
Has anyone else has experienced this problem?
--
_______________________________________________________________________
Rick Beebe (203)
785-6416
Manager, Systems & Network Engineering FAX: (203)
785-3978
ITS-Med Production Services
Richard.Beebe_at_yale.edu
Yale University School of Medicine
Suite 124, 100 Church Street South, New Haven, CT 06519
_______________________________________________________________________
Received on Thu Mar 21 2002 - 02:33:15 NZST