I got a followup from Whitney Latta at HP, which helped pinpoint the root cause - extract as below.
I had already rebooted the system, but the problem reoccurred when I tried the same ar command again, so was reproducable, and this time using dbx I was able to get a thread trace.
It appears the lockup was caused by failed access to a third-party filesystem (ClearCase mvfs remote over the network). The same ar command subsequently worked fine from a local disk.
Thanks again for your help!
- Iain
-----Original Message-----
Sent: Wednesday, 04 August, 2004 12:48
Subject: RE: SUMMARY: Process stuck at 99% and cannot kill it
Hello,
Regarding the runaway process... it will be important to determine what exactly it is doing in order to understand why it is running away like this. The fact that "kill -9 {pid}" failed to terminate the process indicates it is running in an "uninterruptible" state... meaning the underlying functions cannot be terminated, and must run to completion (or system reboot). Assuming the process can accept signals, kill -9 is the most powerful of all signals... if it doesn't respond to that signal, it won't to any other ones.
I'm curious why using dbx(1) to examine the executing thread "locked up" when trying to examine the running process. The method to examine an individual process should allow you to obtain a stack trace even on a thread that has "run away" like this (This assumes the system as a whole still has cpu cycles remaining to run dbx properly).
Here is how I would use dbx(1) to examine a running thread:
1) First find the "pid" number of the process (pid=4559 from the ps output below).
2) Next invoke dbx on the running kernel, by typing "dbx -k /vmunix"
3) At the "dbx>" prompt, type: "set $pid=4559"
4) If this process is "single-threaded", then type "t" to get the stack trace.
4a) If this process is "multi-threaded", type "tlist" to obtain a list of the thread addresses.
4b) since each individual thread is a schedulable entity, set dbx context to each thread in turn, by typing "tset {thread-address}", followed by "t" to get the stack trace. (note "thread-address" will be the 64-bit hex address displayed by the "tlist" command above, for multi-threaded processes).
5) Once you have found the thread that is running away, run the "t" command several times in a row... what you will want to look for is whether the thread appears to be looping through the same series of function calls over and over, or is making some kind of progress through a linked-list or some other areas of code.
The above dbx operations should return very quickly, if the system as a whole has resource capacity.
Another alternative, if dbx(1) still fails to yield results, is to run "truss" or "trace" against the running process. Truss is part of the SysV environment... and is optional (so, it may not be on the system now; it can be loaded on later). The "trace" utility is a standalone executable and is available in the public domain. I have a version built for V5 and can make it available on our ftp-server. It can be run against a running process and, similar to "truss", will return the system calls and their exit codes (which can be captured in a logfile for easy review).
I would suspect that something is failing and looping through some low-level kernel-mode calls to palcode.
I hope this is helpful.
Regards,
Whitney Latta
Complex Problem Mgr
HP Global Solutions Engineering
Received on Wed Aug 04 2004 - 23:26:28 NZST