SUMMARY: crash evaluation

From: Ray Stell <stellr_at_smyrna.cc.vt.edu>
Date: Wed, 12 Apr 1995 14:16:06 -0400 (EDT)

Original Question:
==================
I would like to get some pointers on how to evaluate a crash
data, docs to purchase, etc. My crashdc data is below, how
does one look at it?
... crashdc deleted, available on request ...


Answer:
=======
Call dec. 1-800-354-9000
Check the Kernel Debugging manual AA-PS2TC-TE

(editor: from the Crash Analysis Examples of the Kernel debugging doc)

     Finding problems in a crash dump file is a task that
     takes practice and experience to do well. Exactly how
     you determine what caused a crash varies depending on
     how the system crashed. The cause of some crashes
     are relatively easy to determine, while finding the cause
     of other crashes can be difficult and time-consuming.
                               -----------------------------

                                (editor: well, forget that 8^)


Responses:
==========
Jon Reeves <reeves_at_zk3.dec.com>
-------------------------------
Well, in this case, there really isn't too much there of value: All those
warnings are really telling you that most of the normal information isn't
valid. Given that this was a machine check, that's not real surprising.

You probably need to look at your uerf data. Either that, or find a
hardware expert; if you call the support line, they may have some other
tricks.

But to answer the original question: the manual you want is the Kernel
Debugging manual, which lives on the System & Network Administration
bookshelf.

alan_at_nabeth.cxo.dec.com
-----------------------
        Machine checks are nearly always a hardware failure of
        some sort, but I don't know of any documents that
        adequately explain that. The System Administration
        Guide of the base documentation set would be the first
        place to look though. The guide to Kernel Programming
        and Debugging may also be helpful, though I haven't
        looked it (AA-PS2TC-TE).

        Some versions of ULTRIX would include a kernel error
        message manual that would explain many of the kernel
        messages including machine checks and traps. A Reader's
        Comment letter to our documentation suggesting one for
        Digital UNIX may start some work in that direction.

        The hardware manual for the particular system may also
        offer some clue as to what machine checks are possible.

        As for the crash dump, it looks like parts of it didn't
        get written, which could either be part of the cause or
        a symtom of the machine check.


small_at_gidday.enet.dec.com
-------------------------
The best place to start is the Guide to Kernel Debugging. You will find a copy
(in bookreader format) on the distribution CD.

Your crash-data (machine check) indicates a hardware detected error. You need
to place a support call on your hardware.

I generally find the preserved message buffer (dbx> p *pmsgbuf) is a good place
to start. This is the information usually logged to the messages file. Events
prior to a hang or panic often point to the cause of the problem.

walter_at_decum.enet.dec.com
-------------------------
your system has an "Hardware error" ...............

replace simm module 6

     (editor: I asked how this was derived)

i know this from the info of your crash ......

the crash is reporting an "Hardware error" and the info from :

BIU_STAT Register
FILL_ADDR Register
SYNDROME Register
BCache Tag Registe


reports an error in simm 6 ..........


     (editor: I insert those registers here)

        biu_stat = 0000000000002440
        biu_addr = 0000000003681f10
        biu_ctl = 0000000e10006335
        fill_syndrome = 0000000000000001
        fill_addr = 0000000003681f10
        va = 00000000001081e8
        bc_tag = 0000000000400652

     (editor: how did you know this?)

my info is from a analyse tool, where i need the register contenses .....

I think that there is a book available (for digital internal use only)
but i don't have a pointer to it
I have books about the registers from the sable systems, but the info is
only for digital internal use .... and i don't want to lose my job .

I am very sorry that i can give you only such an unsatisfied answer.

       (editor: I appreciate your effort and constraints,
                Is the "analyse tool" private, also?)

You are right ....... the analyse tool is only for digital internal use .
....

But when someone have a hard-ware problem you can call your digital
service center , and a specialist from digital can help you replacing
your memory or board .


  (editor: I called the support center and this is what I have from them)

digital service center:
=======================
sandy_at_jerry.alf.dec.com
-----------------------
You do have a hardware problem with memory. Yes, we have some tools here
that we use. Your best bet is to log a hardware call with the serial number
of the machine and they will let you know if it is still under warrenty
or what the status. I cannot tell you which simm unless I know how many
memory boards and how many mb of each of. JAlso is uerf giving any
memory errors?


Will still need to know how many boards of what with memory to tell
you which simm.

     (editor: I asked if there was a way to tell from the os the
               answer to this question and got no reply, Sandy,
               are you ok? )

===============================================================
Ray Stell stellr_at_vt.edu (703) 231-4109 KE4TJC

the IRS has been recognised as a leader among goverment agencies
in customer service. - Margaret Milner Richardson, IRS commissioner
Received on Wed Apr 12 1995 - 14:16:40 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT