SUMMARY: kernel memory faults on Alphaserver 1200 4.0d PK3 from Chris Wigglesworth on 1999-04-24 (tru64-unix-managers)

From: Chris Wigglesworth <c.wigglesworth_at_open.ac.uk>
Date: Fri, 23 Apr 1999 17:19:09 +0100

Once again this list proves invaluable. Although my problem isn't solved, my
question has been answered. To put it very simply, it appears that the
problem is related to dtfile. I have discouraged the use of dtfile and the
machine has not crashed since, but then again it has only been a couple of
days. If the machine remains up for a reasonable length of time I will post
an update.

Many thanks to the following people who replied with help:

Harald Baumgartner
Tom Blinn
Doreen Fletcher
Alan Rollow
Bob Vickers

Their replies to my original message follow with comments, my original
message follows that:

Harald Baumgartner [hmb_at_xray.mpe.mpg.de]

similiar problem was a defective disk (advfs).

Try:

vncheck domain check all domain's of this server

for example:
vncheck data02b_dmn#data02b

if the machine crashes by checking one of the domains or report error's
remove
the disk and restore it from backup.

Dr. Tom Blinn [tpb_at_doctor.zk3.dec.com]

(Currently as noted above I have discouraged the use of dtfile. Many thanks
to Tom Blinn for the time he took to look at the crash data)

Anyway, I've looked through your crash data and I'll tell you what I see; it
is not a pretty picture:

> _host_version: Digital UNIX V4.0D (Rev. 878); Sat Apr 17 14:51:02 BST
1999
> _crash_version: Digital UNIX V4.0D (Rev. 878); Sat Apr 17 14:51:02 BST
1999

This means the kernel that was running when you rebooted is the same kernel
that was running when you crashed, which means that the crash data analysis
is as good as it's likely to get. If these two lines don't match, then the
"vmunix" file copied into /var/adm/crash doesn't really match what's in the
core file extracted from the swap area, and consequently the analysis might
not be correct.

> _ncpus: 1

Running a single CPU configuration, that always makes life easier in trying
to do the crash analysis.

> _panic_string: 0xfffffc00006d07d8 = "kernel memory fault"

This confirms it was a kernel memory fault that brought the system down.

The following are the messages that were displayed in the period just prior
to the panic; nothing out of the ordinary, as far as I can see (other than a
few warning messages out of AdvFS that are useless and can be ignored):

> SuperLAT. Copyright 1994 Meridian Technology Corp. All rights reserved.
> chk_bf_quota: user quota underflow for user 183 on fileset /
> chk_bf_quota: user quota underflow for user 183 on fileset /
> chk_bf_quota: user quota underflow for user 183 on fileset /
> NFS server: stale file handle fs(2132,3856) file 128531 gen 32777
> RFS3_ACCESS, client address = 137.108.72.21, errno 22
> chk_bf_quota: user quota underflow for user 180 on fileset /
> chk_bf_quota: group quota underflow for group 7 on fileset /

and here is the panic signature:

> trap: invalid memory ifetch access from kernel mode
>
> faulting virtual address: 0x0000000000000008
> pc of faulting instruction: 0x0000000000000008
> ra contents at time of fault: 0x0000000000000008
> sp contents at time of fault: 0xffffffff903c3a28
>
> panic (cpu 0): kernel memory fault
> syncing disks... 8 done
> device string for dump = SCSI 1 4 0 0 0 0 0.
> DUMP.prom: dev SCSI 1 4 0 0 0 0 0, block 262862
> device string for dump = SCSI 1 4 0 0 0 0 0.
> DUMP.prom: dev SCSI 1 4 0 0 0 0 0, block 262862

So, while something was running, you took a memory management trap because
of an invalid memory instruction fetch while running in the kernel, and the
reported address where you were running when this happened is 8, which is a
totally invalid address. To make matters worse, the reported PC where you
were running is also 8, and the RA is 8. The SP (stack pointer) might be a
useful piece of data, in conjunction with the full crash dump file, since a
trashed stack could have the garbaged data that's reported in the trap. In
any case, these are "impossible" values; if in fact you managed to branch to
address 8 or try an instruction fetch from address 8, that would cause this

panic.

The next useful item comes later in the file:

> _current_pid: 28413

The last user mode process running is reported to be PID 28413, and looking
in the table of _kernel_process_status that appears to be dtfile:

> _kernel_process_status_begin:
> PID COMM
> 28413 dtfile

but it's not clear whether that PID had been running a long time or was just
starting. Whoever was running the CDE session appears to have had "puzzle"
in their startup, since it had one of the next available PIDs.

There is information on the current thread, and that thread is probably part
of the kernel's support for "dtfile", but without wading into data
structures
that aren't here, it's hard to tell for sure. But I would not be surprised
if "dtfile" was trying to do some kind of file operation when the kernel
died.
This is TOTALLY a SWAG.

> _current_tid: 0xfffffc000931a000
> _proc_thread_list_begin:
> thread 0xfffffc000931a000 stopped at [boot:1890 ,0xfffffc00005364d8]
Source not available
> _proc_thread_list_end:

The information from the stack trace follows, but it's really not
informative;
at least, I can't make much sense out of it.

> _kernel_memory_fault_data_begin:
> struct {
> fault_va = 0x8
> fault_pc = 0x8
> fault_ra = 0x8
> fault_sp = 0xffffffff903c3a28
> access = 0xffffffffffffffff
> status = 0x0
> cpunum = 0x0
> count = 0x1
> pcb = 0xffffffff903c3a38
> thread = 0xfffffc000931a000
> task = 0xfffffc0005aca000
> proc = 0xfffffc0005aca220
> }

With more of the kernel data, it MIGHT be possible to figure out more of
what
was running at the time of the fault; or maybe not.

> _uptime: 89.74 hours

Well, at least it doesn't happen right away (small consolation).

Anyway, there's not enough in the crash data to be sure, but it looks like
in
the course of running a thread (in kernel context), something went totally
wrong and resulted in a branch to address 8. I can imagine this happening
if a pointer to an action routine got trashed (e.g., because the structure
in which it was located got freed, or something over-wrote good data with
bad, or something failed to fill in a pointer that should have been there)
and the branch was really supposed to be to an entry point plus 8; if the
entry point in the vector was never filled in, then 8 offset from zero is
just 8, and that's where you were trying to run. By examing instructions
in the context of the thread that was running (which may be in the crash
dump
file itself), it MIGHT be possible to spot a branch to 8 off of some element
in a data structure, and then examining the data structure might reveal what
was missing.

In other words, whatever made this system panic probably is a logic bug in
some routine that ran long before the panic occurred, it's probably
something
like an error handling routine that's supposed to be plugged into something
like an AdvFS file handling structure, and it's not filled in correctly.
But
this is, again, a SWAG (silly wild assed guess), not definitive -- you'd
need
more in depth crash dump analysis to pin this down. Since you are using
AdvFS
and since AdvFS has obscure bugs, and since it looks like dtfile was running
as the last user process before the panic, I'm just guessing here, but there
are enough bits of data to suggest that MIGHT be what happened.

If you keep seeing these random crashes, then you need to really pressure
the
support center to get someone skilled in crash dump analysis to really take
a look at the crash dumps to try to figure out what triggered the panic, as
it is NOT trivially obvious, the condition that occurred should NEVER have
happened. Just applying patches that aren't specific to the root cause of
this problem won't fix it.

Doreen Fletcher [FLETCHED_at_odc.edu]

(I have not currently made these changes, but will try them if staying away
from dtfile does not have the desired effect)

I was having this problem as well, so it was recommended I increase shmmax
from 8388608 to 67108864. I also increased SHMSEG from 32 to 50 and MSGMNI
from 16 to 41.

Alan Rollow [alan_at_nabeth.cxo.dec.com]

(This was not my problem as we had fallen into this trap when we first got
the system and installed the correct MRU kit version after a few similar
kernel memory fault crashes)

Unfortunately, your crash dump listing doesn't include the call
stack of the upper half of the kernel where all the interesting
information is. However, since you have a media changer, I think
I know to look at... What version of the CLC driver did you
install. If it was the highest one off the MRU kit, that's
probably what is causing the crash. You want the one on the
Associated Products CDROM.

Bob Vickers [bobv_at_dcs.rhbnc.ac.uk]

(It turns out that the problems were not related)

I had crashes with a similar crash-data file when I installed 4.0B patch
kit 8. So I reverted to patch kit 7, and Compaq said the problem was fixed
in patch kit 9. I installed patch kit 9 yesterday and the crashes happened
again, so I reverted to patch kit 7 again.

I have forwarded your message to the lady at Compaq support investigating
my incident.

My guess is that it was a bug introduced by 4.0B PK8 and 4.0D PK3, but I'm
waiting for a more authoritative pronouncement from Compaq.

Original Message: (excluding crash data)

We have an Alphaserver 1200 5/533, 256Mb memory with Digital UNIX 4.0d (rev
878) patch kit 3(2/8/99)

Recently it has been crashing with kernel memory faults. In consultation
with Compaq service we successfully installed patch kit 3 (2/8/99) to try
and solve the problem, but again today it has crashed with a kernel memory
fault.

I would really appreciate it if someone is able to give me some idea of what
the problem is so that I can have a go at fixing it. I have included the
DEDevent log entry, and the crash-data file as an attachment. If there is
any other information that would be helpful please inform me and I will
update my question.

******************************** ENTRY 3 ********************************

Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 4.
Timestamp of occurrence 21-APR-1999 08:38:53
Host name titan

System type register x00000016 Alpha 4000/1200 Series
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 302. ASCII Panic Message Type

SWI Minor class 9. ASCII Message
SWI Minor sub class 1. Panic

ASCII Message panic (cpu 0): kernel memory
fault----------------------------------------

Chris Wigglesworth
Physics Department, The Open University,
Walton Hall, Milton Keynes, MK7 6AA.
Tel: +44 (0) 1908 652127
Fax: +44 (0) 1908 654192
e-mail: c.wigglesworth_at_open.ac.uk
----------------------------------------
Received on Fri Apr 23 1999 - 16:20:17 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT