SUMMARY: CPU EXCEPTION OS EVENT TYPE 100 from Mr. Jolt Cola on 1996-07-11 (tru64-unix-managers)

From: Mr. Jolt Cola <msmith_at_quix.robins.af.mil>
Date: Wed, 10 Jul 1996 14:42:05 -0400 (EDT)

Concerning my problem with uerf logging a CPU EXCEPTION
every second under load I received a reply and also had
DEC come in and diagnose the machine.

Apparently an exception or two per day is acceptable, but
20,000 per day is pointing to hardware problems. DEC says
I have a bad memory module.

How to diagnose this yourself, DEC says:

uerf -r 100 -Z | more

Excerpt:

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Tue Jul 9 08:48:50 1996
OCCURRED ON SYSTEM quix
SYSTEM ID x00060009 CPU TYPE: DEC 2100
SYSTYPE x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

RECORD ENTRY DUMP:
RECORD BODY

Look at address 0338

0338: E2000008 00233210 20200503 002000CF *.....2#... .. .*
0348: 800150A0 800150A0 02420894 0C140D02 *.P...P....B.....*
0358: 0000000D 00000526 20000000 20000000 *....&...... ... *

At address 0338 the long-word E2000008 means a memory problem, or
at least that is how I understood the technician. He also said
that 40001 is the error code, but I don't see a 40001.

Hope this helps someone.

Melvin Smith

> CPU EXCEPTIONS are reported for a variety of things. We see them
> with both of our 2100's on occassion and they are typically
> single bit correctable memory errors. Once in a while they are
> bcache errors (correctable). How to tell what they are is almost
> impossible with uerf (can be done if you use uerf -Z and somebody
> at Digital has told you the "secret" binary codes).
> dia (decevent, available with 4.0, i believe, we got an advance
> copy for v3.2d) does break them out somewhat, but you still have
> to know that '40001' error status for a memory module is a single
> bit correctable error.
>
> Since you had oodles of them one weekend, something was clearly broke
> (needed re-seating, at a minimum)... whether it was cpu board or
> memory depends on further breakdown. If you continue to get them
> in low volumes (say, less that a couple single bit correctable
> errors a day) you may be ok... or maybe they replaced the wrong thing.
>
> I'd suggest each time you get a CPU EXCEPTION you analyze what it is
> and insist Digital tell you want it means (and how to analyze for
> yourself... so you don't call in every single bit correctable).
> We've gone as far to automate nightly summary reports from uerf and
> decevent so we know in the morning what happened the day before.
> Automated human readable uerf reports would be a nice feature for
> Digital to add (though we already did it for ourselves, so we don't
> care about them doing it anymore).
>
> Hope that helps, Kurt Carlson, U of Alaska
Received on Wed Jul 10 1996 - 21:25:03 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT