SUMMARY: Tru64 5.1 - Uninterruptible sleeping process

From: Aldridge, Robert E. <REAldridge_at_mcdermott.com>
Date: Mon, 26 Nov 2001 12:13:23 -0600

Tru64 Managers,

Thanks to Thomas Blinn and Alan Davis for taking a look at this problem.
They pointed out that isolating the problem would require dbx but that it be
difficult to get meaningful information.

I ended up calling Compaq support on this issue. It ends up being a bug
that is associated with the defragment process. Here's an extract from the
Compaq diagnosis:

===========================================================
...As it turns out, these processes were the victims of something else that
had blocked access to certain filesystem resources. A number of
"defragment" threads were running on the system, which had been started
by the "defragcron" crontab entry, at 01:01am:

1 1 * * * test -x /usr/sbin/defragcron && /usr/sbin/defragcron -p -l
/usr/adm/defragcron


Defragment is a multi-threaded process, and spawns multiple threads to
actually perform the defragmentation operations on all AdvFS domains.
Normally, this is not a problem and the defragment operations run and
should complete fairly quickly, depending on size and fragmentation of
the domains. In this case however, 2 of the defragment threads
themselves became stuck in an unending "uninterruptible" loop... and in
particular, one that was running like this, while operating on the
shared "cluster_root" fileset, resulted in the condition that hungup the
"launcher" processes that were started after 7am (when the defragments
"should" have already competed their tasks!). The launcher processes
would have made progress had the defragment thread completed it's task
and released the mutex... the code they blocked in would wait forever,
Uninterruptibly, until this occurred... or the system was rebooted.

...

The issue with the defragment thread running on cluster_root has been
identified and is fixed in the soon to be released patchkit#4 (BL18) for
Tru64 Unix V5.1....
===========================================================


I did learn some things, during the process of tracking down these issues.


Showing process threads...
dbx -k vmunix
set $pid=0123456789
t (threads command)


Dumping an image of the running system *without forcing a crash*
dumpsys -s /var/adm/crash


Accumulating crash information for sys_check...
crashdc vmunix.n vmzcore.n > crash-data.n



Thanks again for the assistance.

Robert
..................................


ORIGINAL question:

Tru64 Managers,

I searched the archives for this problem -- processes in "U" state.

I found the hint to use "ps l" to find the wait state.

My question is -- how do I interpret the wait state?

What is this "ca8c608a" that's identified in my sample output below?


0 545688 545259 0 44 0 2.65M 304K ca8c608a U + pts/13 0:00.02
  ./abaqus
Received on Mon Nov 26 2001 - 18:14:08 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT