SUMMARY of CPU Machine Exception Crashes on Alpah EB164 from Ronald D. Bowman on 1997-11-15 (tru64-unix-managers)

From: Ronald D. Bowman <rdbowma_at_tsi.clemson.edu>
Date: Fri, 14 Nov 1997 18:42:17 -0500

Dear Managers:

        I would like to thank all the managers that responded to my
        posting about my crash problem. The original post was on 10/28.

Thanks to the following people:
Gary Menna: G.Menna_at_isu.usyd.edu.au
Dave Cherkus: cherkus_at_homerun.unimaster.com
alan_at_nabeth.cxo.dec.com

Dr. Tom Blinn: tpb_at_zk3.dec.com

The following are their responses and what transpired from them. My original
posting is at the end.

FINAL CONCLUSION: The most probable cause is over heating of the Bcache Simms.
Work is currently being done to resolve the over heating problem.

SOME INFO FROM THE CRASH DATA FILES: What you may see if you have this problem:
1) a line in the crash data file that has: EI_STAT reg = fffffff014ffffff
2) a panic string of: "Processor Machine Check"
3) lines in succession with the following:
        Machine Check Processor Fatal Abort
        Machine Check Code = 98
        Machine Check Code = 98
        Processor detected hard error
4) From the UERF file information you will probably see the following:

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.

On a side note, in the trouble shooting procedure we were able to determine that
disabling the Bcache on the EB164 results in an unsteady platform - If the Bcache
were disabled, the machine tended to have reboot problems after crashes.

Sincerely,
Ron Bowman
Techno-Sciences, Inc.
864-646-4028
864-646-4001(fax)
rdbowma_at_ces.clemson.edu
[rdbowma_at_tsi.clemson.edu invalid after 11/17 until the machine comes back]

-------------------------------------------------------------------
>From Gary Menna:
----------------
>Hi ,
> Did you check what process was running during these crashes ?
>Look in your /var/adm/crash/crash-data.n file and look at the line.

>_current_pid: 6435

>Match that up to the list of processes above it and see if it was the
>same process each time. This may narrow things down a bit .

>Good luck ,
>Gary Menna

I had already looked into this, and it was a different process for each crash.
As it turns out the problem was hardware related, so this test probably does not
help much with hardware problems.
------------------------------------------------------------------------

>From alan_at_nabeth.cxo.dec.com:
-----------------------------
>Generally speaking "machine checks" are hardware errors. You
>need to use the error log formatter (dia(8) or uerf(8) to
>format the corresponding error log entry and then find an
>expert to interpret the result.

        This is the path that was taken(see below)
--------------------------------------------------------------------------

>From Dave Cherkus:
------------------
>Hi,

>The crash dump is really designed to help debug software problems.
>What you are seeing is a pure hardware error. The best tool to
>debug these issues is DECevent. To see if it is installed,
>try running the 'dia' command. If not, check the associated
>products cd. If this doesn't work, the fallback is the 'uerf'
>command, which comes with the base-os but isn't very intellegent.
>In any regard, the outcome will be that some part of your system
>will probably need to be replaced.

The information I obtained from the crash data files and uerf eventually
pointed to the hardware problem. Though understanding this information
came from help from Dr. Blinn and his colleagues at DEC. I would have
not been able to decipher that information on my own.
----------------------------------------------------------------------------

>From Dr. Blinn:
---------------
This is the first message. It was through several e-mail conversations with
Dr. Blinn that resulted in the conclusion mentioned at the top.

Ron,

When you get this combination of events:

> _panic_string: 0xfffffc00004dd890 = "Processor Machine Check"

and

> Machine Check Processor Fatal Abort
> Machine Check Code = 98
> Machine Check Code = 98
> Processor detected hard error

what you're seeing is a hardware error. The 21164 processor detected a
problem that was reported to the system software as a machine check. In
general (this is not unique to the EB164 system design), there is no way
for the system software to recover from a machine check, so rather than
let the system keep running (and probably do nasty things like corrupt
your file systems), the kernel shuts the system down immediately. While
a crash dump file is generated, it usually has no useful information.

I would have to dig out my EB164 reference materials to try to figure out
just what a "Machine Check Code = 98" probably means; there is software that
is closely linked to the hardware that generates that code and leaves it in
an error register, from which it is extracted and reported. It's possible
that the error code provides a useful clue to what's failing, that might be
something you can fix. More likely it's only useful to someone who knows
how to remanufacture the board, since it could be a failing CPU chip or it
could be a failing component that's soldered to the mother board.

Figuring out whether the board is repairable is normally part of the added
value that is provided by whoever sold the board to your organization. As
far as I know, Digital never routinely sold those boards directly to end-
user organizations, but I could be mistaken about that. In any case, that
board is, as far as I know, no longer manufactured. Spares probably can be
had through your reseller.

Hope this helps you at least a little bit. I'll pass a copy of your message
as well as this reply to my contacts in our Digital Semiconductor group who
are involved with the board design and UNIX software support; they may get
in touch with you directly through their support team.

Tom
--------------------------------------------------------------------------------

Original Posting of Crash problems:
-----------------------------------

Our system which is a dec alpha experimental board 21164 had been
crashing sporadically the first 2 months we had it. After a
few changes to the scsi bus and the video card it appeared our problems
were over. We ran successfully 19 days without a crash and then
in the last 48 hours have crashed 5 times.

More detailed examination of the crash data files resulted in some
information that may help us in solving our problems. We are looking
for some help from anyone who may provide some insight into our
problems.

SYSTEM:
        system is using the dec/osf1 operating system release 4.0 version
        564(4.0B). We have a 333 MHz processor on an experimental board
        21164. Memory is 256 Meg. The jumbo patch #4 from april/may
        has been installed.

_system_string: 0xffffffffff800798 = "Alpha 21164 Evaluation Board 333 MHz"

INFORMATION FROM Crash-Data file: This information is common to the 14 crash
files that we have saved- not just the 5 most recent crashes.

Finally realized that all of our crashes appear to be caused by the same phenomenon.
The panic string is as shown below(same in all the crash files):

_panic_string: 0xfffffc00004dd890 = "Processor Machine Check"

Finally reading through the information on dbx lead to more insight
into how similar our crashes were. looking at the trace information
of all the crash files

All of the crash files had the following information in them:

Machine Check Processor Fatal Abort
Machine Check Code = 98
Machine Check Code = 98
Processor detected hard error

We also saw that the dumps looked the same as shown below:
What may be of interest is that the line of code that caused
the panic is always the same. in our case there appear to be
two cases(I do not know much here), but the first appears to be
from entries 2 and 3 always lines 1925 and 3820 of sched_prim.c.

The other case (the one I feel is more likely the cause of the problems)
is entry 6. That is line 1859 of the file eb164.c when executed under
certain conditions results in the panic event and crashing of the system.
Unfortunately, there is no way(that I know of) to find out what is actually
being attempted at this line of code. If we knew that, then maybe we could
determine what is causing our crashes.

_dump_begin:( this same information appears twice more in our crash files)

0 boot(0x400000000, 0xfffffc00004bdd90, 0xfffffc00004bdd90,
        0xfffffc000e6b42e0, 0xfffffc000027a1b4)
["../../../../src/kernel/arch/alpha/machdep.c":2484, 0xfffffc00003c87dc]

1 panic(s = 0xfffffc00004bffa0 = "thread_block: interrupt level call")
    ["../../../../src/kernel/bsd/subr_prf.c":707, 0xfffffc000027b79c]
    pcpu = 0xfffffc00005218c0
    i = 4980640
    mycpu = 0
    spl = 5

2 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1925,
    0xfffffc00002a9e90]
    thread = 0xfffffc0001954dc0
    new_thread = 0xfffffc00004fce58
    mycpu = 0
    myprocessor = 0xfffffc000011c100
    s = 5
    pset = 0xfffffc00004f3730

3 thread_preempt(thread = 0x26, processor = 0xfffffc000011c100)
    ["../../../../src/kernel/kern/sched_prim.c":3820, 0xfffffc00002aca24]
    s = 2
    pset = 0x1

4 boot(0x0, 0xfffffc0001954dc0, 0x2c0000002c, 0x37, 0x1)
    ["../../../../src/kernel/arch/alpha/machdep.c":2431,
    0xfffffc00003c86b8]

5 panic(s = 0xfffffc00004dd890 = "Processor Machine Check")
    ["../../../../src/kernel/bsd/subr_prf.c":791,
    0xfffffc000027b93c]
    pcpu = 0xfffffc00005218c0
    i = 5204704
    mycpu = 0
    spl = 7

6 machcheck(0x2, 0x0, 0x670, 0x20000001a,
            0xffffffff9040b678)
    ["../../../../src/kernel/arch/alpha/hal/eb164.c":1859,
    0xfffffc00003f60dc]

________________________________________________________________

Any help with explaining what we are experiencing would be greatly
appreciated.

Sincerely,
Ron Bowman
Techno-Sciences Inc.
rdbowma_at_tsi.clemson.edu
Received on Sat Nov 15 1997 - 00:57:06 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT