Our Problem:
============
We are running Tru64 Unix V5.1B PK5. Our own application (a relational
database) is going into a "U" state and staying there permanently. We
cannot attach to it via ladebug; cannot kill it even with a -9. The
problem has happened 3 times in the last 5 days. The system is not
paging, not reporting hardware problems, etc.
Many thanks for those who responded. I got some great ones, and learned
a considerable amount. The responses are below, but I'll post the
solution here for those who just want to cut to the chase.
Solution:
=========
Upon close examination of the system (an ES40), we noticed one of our
HBA fibre cards had a solid green light on, but the amber light was NOT
flashing as it should. According to the HBA manual, when the green light
is solid and the amber light is not present, it means "failure while
functioning". The infuriating part is that we were getting no Tru64 Unix
errors (DECevent or otherwise) and no EVA SAN errors. (HP was definitely
not "phoning home to us" to let us know there was a problem.)
So this feels like perhaps it's a bug on their part. Interestingly, all
the diagnostic tests we ran (from chevron, Unix, and the EVA) indicated
the card was working. It would make sense, then, that Unix would issue
an I/O request and use that interface because it thought it was working.
But most likely the request would return -- hence the "U" state.
We simply unplugged the cable from the HBA (redundant connection to the
EVA) and are waiting on a new card to replace it. I suppose it's
possible it could be a bad fibre cable, but in any case, ever since we
unplugged, we haven't had any more "U" processes!
Responses:
>From Thierry FAIDHERBE:
=======================
Uninterruptable state means the process
perfomed an I/O related system call (mem read, IO to/from a socket, IO
to disk, ... ) that never completed or is pending for a thread to
complete.
Not so easy to debug, I admit it.
I would suggest :
* To start looking at FS layout,
locally ot network attached sw,
* To monitor NIC error/collision, FC/SCSI related erros from
binary.errlog.
* To list what changed since a week (patch, ...)
You can also force a system crash when having problem and have it
analyzed by your support orig. ( refer to your support team for
guidelines)
>From HP Support (on how to force a crash):
==========================================
To get a "snapshot" of what the processes are doing in the kernel, or
can create crash dump files using the command "dumpsys". This keeps the
system up and running. If you want a real forced crash, you need to hit
the halt button (triangle in a circle) and then type "crash" at the >>>.
Remember to use the halt button, not the reset button. For what you
are looking at, i.e. a hung process, using dumpsys should be fine. The
files will be put in /var/adm/crash.
If you would like us to take a look, please run #sys_check -escalate
This will collect files from the system, as well as the latest
vmunix/vmzcore files (which are the crash files created).
>From John Lanier:
=================
I know sometimes this can happen when:
1. The process's PPID has died.
2. The process is at an elevated IPL or, perhaps, an elevated nice
value when it fails (?).
Below is a procedure that I came across a while back that you may want
to try if kill doesn't work:
Problem:
========
How to kill processes in a "U" (Un-interruptible) state, per the "ps"
command?
Resolution:
===========
As root:
=========
#dbx -k /vmunix /dev/mem
dbx>kps (scroll through the PID list to find the applicable PID)
dbx>set $pid=PIDNUM
dbx>p $pid (verify the correct PID is selected)
dbx>p thread.interruptible
0 <---Uninterruptible
dbx>a thread.interruptible=1
1 <---Interruptible, hence killable..
dbx>quit
........
Now one can kill the applicable pid(s) using the applicable "kill"
signal
(See "kill -l" or "man kill" for details).
NOTE: USE WITH CAUTION! Take care that other threads in the target
process
will not be affected by this.
One way to check:
==================
#ps -Amo pid,ppid,rssize,comm,state,cputime,pcpu,psr| \
grep -i $PID | grep -v grep
**************
-A Writes information for all processes.
-m [Tru64 UNIX] Prints all threads in a task, if the task has more than
one.
-o specifier[=header],...
Specifies a list of format specifiers to describe the output format.
Multiple -o options may be specified. The final output is a concatena-
tion of all options specified.
[Tru64 UNIX] If the -O option is used with one or more -o options, the
-O option must appear first on the command line.
(This is from "man ps")
>From Bugs:
==========
To kill process such a "defunct", or a process that wont die.
Key in exactly as shown, replacing the pid number.
kdbx -k /vmunix
(kdbx) set $pid=<the pid you want to stop> ###for example: (kdbx) set
$pid=777
(kdbx) p (*(struct super_task *)thread.task) .proc.p_stat
3 (stat will usualy be a 3)
(kdbx) a (*(struct super_task *)thread.task) .proc.p_stat=0 0
(kdbx) u
(kdbx) exit (exit after each process that you kill)
Received on Wed Aug 23 2006 - 17:42:43 NZST