I got a lot of helpful replies though the unanimous bottom line is that only a reboot will make it go away -- note that I didn't say 'fix it'.
However, I did get a particularly helpful email from Whitney Latta at HP detailing the steps to take with dbx to determine what is wrong with ps:
1)Enter a dbx session (do this as root): type "dbx -k /vmunix"
2)At the "dbx>" prompt, find the "pid" of the ps command to examine by typing "kps"
3)Set the dbx debugger context directly to that "pid" by typing "set $pid=####" (Where "####" is the pid number of the ps process to be examined...)
4)Verify where you are by typing "pd $pid"; it should return the pid number you set in above.
5)Display the process stack trace by typing the command "t".
(you could look at each one in turn by repeating the "set" command, with the new ps command's pid... then repeat the "t" command for each, and compare if they all look similar; then, they're blocked for the same thing!)
6)To exit the debugger, simply type "quit"
Whitney then went on to say:
"The function calls seen on the stack will reveal what the thread was doing up to the point where it blocked and context-switched out of the cpu. Try this and send the results out to the dist."
So I did as suggested and am sending out the results of the stack:
THIS IS FOR ONE OF THE 'PS' SESSIONS:
0 thread_block() ["../../../../src/kernel/kern/sched_prim.c":3230, 0xfffffc00002ea190]
1 u_anon_lock(0xe000, 0x0, 0xfffffc0008ae38b8, 0x120000000, 0x8) ["../../../../src/kernel/vm/u_mape_anon.c":4174, 0xfffffc000062cf14]
2 u_anon_dupmcopy(0xfffffc0008ae38c0, 0xfffffc00369faf28, 0xfffffc0037245800, 0xfffffc0008ae38c0, 0x120000000) ["../../../../src/kernel/vm/u_mape_anon.c":
2868, 0xfffffc000062a80c]
3 u_stack_dup(ep = (unallocated - symbol optimized away), va = (unallocated - symbol optimized away), size = (unallocated - symbol optimized away), nep =
(unallocated - symbol optimized away), copy = (unallocated - symbol optimized away)) ["../../../../src/kernel/vm/u_mape_stack.c":555, 0xfffffc000063a5b8]
4 u_map_copyin(0xfffffc002a416678, 0x11fffe000, 0xfffffc0000000000, 0x120000000, 0xfffffe0451646f40) ["../../../../src/kernel/vm/vm_umap.c":2523, 0xfffffc
0000655b34]
5 table(0xfffffc00002ca718, 0x11fff10c0, 0x0, 0xfffffc003b794000, 0xfffffe04008a8020) ["../../../../src/kernel/bsd/cmu_syscalls.c":1694, 0xfffffc00002577c
8]
6 syscall(0x140007fd0, 0x1, 0x1000, 0x30, 0x0) ["../../../../src/kernel/arch/alpha/syscall_trap.c":725, 0xfffffc000066bf40]
7 _Xsyscall(0x8, 0x3ff800d6988, 0x14000ee30, 0x6, 0x8592) ["../../../../src/kernel/arch/alpha/locore.s":1814, 0xfffffc000066f71c]
THIS IS FOR A SESSION THAT IS HUNG AFTER HAVING ISSUED 'W':
0 thread_block() ["../../../../src/kernel/kern/sched_prim.c":3230, 0xfffffc00002ea190]
1 u_anon_lock(0xe000, 0x0, 0xfffffc000e7623b8, 0x120000000, 0x8) ["../../../../src/kernel/vm/u_mape_anon.c":4174, 0xfffffc000062cf14]
2 u_anon_dupmcopy(0xfffffc000e7623c0, 0xfffffc0015a6b4a8, 0xfffffc002549aa00, 0xfffffc000e7623c0, 0x120000000) ["../../../../src/kernel/vm/u_mape_anon.c":
2868, 0xfffffc000062a80c]
3 u_stack_dup(ep = (unallocated - symbol optimized away), va = (unallocated - symbol optimized away), size = (unallocated - symbol optimized away), nep =
(unallocated - symbol optimized away), copy = (unallocated - symbol optimized away)) ["../../../../src/kernel/vm/u_mape_stack.c":555, 0xfffffc000063a5b8]
4 u_map_copyin(0xfffffc002a417078, 0x11fffe000, 0xfffffc0000000000, 0x120000000, 0xfffffe0451616f40) ["../../../../src/kernel/vm/vm_umap.c":2523, 0xfffffc
0000655b34]
5 table(0x26, 0xfffffc00002984ac, 0xfffffc000066bf44, 0x14000d600, 0x2000) ["../../../../src/kernel/bsd/cmu_syscalls.c":1694, 0xfffffc00002577c8]
6 syscall(0x11fffbe88, 0x1, 0x80, 0x120020a10, 0x0) ["../../../../src/kernel/arch/alpha/syscall_trap.c":725, 0xfffffc000066bf40]
7 _Xsyscall(0x8, 0x120017cf8, 0x12000fd30, 0x6, 0x858a) ["../../../../src/kernel/arch/alpha/locore.s":1814, 0xfffffc000066f71c]
They look very similar (if not identical) so I won't paste in the stack of the hung 'df' session.
Does this mean anything to anybody? Anything that can be done or am I still left with a reboot as my only option? I probably can't reboot for a few days and the problem seems to be getting worse -- yesterday it was only 'ps' that was hanging, today 'df' is hanging as well (it wasn't hanging yesterday).
Thanks alot!
Andy
ORIGINAL QUESTION
=================
Hi,
When we're logged in as root and issue a 'ps' nothing happens - it just hangs. I can't kill them -- they won't die. We're running 5.1A on an AS 800. Have I exhausted some process limit?
Thanks,
Andy
p.s. -- I don't see anything in /var/adm/messages or uerf.
BEGIN-CANIT-VOTING-LINKS
------------------------------------------------------
Teach CanIt if this mail (ID 564398) is spam:
Spam:
http://mail-gw.cognex.com/canit/b.php?c=s&i=564398&m=8b7d677ffe22
Not spam:
http://mail-gw.cognex.com/canit/b.php?c=n&i=564398&m=8b7d677ffe22
Forget vote:
http://mail-gw.cognex.com/canit/b.php?c=f&i=564398&m=8b7d677ffe22
------------------------------------------------------
END-CANIT-VOTING-LINKS
Received on Tue Feb 10 2004 - 21:00:12 NZDT