Concerning my problem with uerf logging a CPU EXCEPTION
every second under load I received a reply and also had
DEC come in and diagnose the machine.
Apparently an exception or two per day is acceptable, but
20,000 per day is pointing to hardware problems. DEC says
I have a bad memory module.
How to diagnose this yourself, DEC says:
uerf -r 100 -Z | more
Excerpt:
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  100.     CPU EXCEPTION 
SEQUENCE NUMBER                  1.
OPERATING SYSTEM                        DEC OSF/1 
OCCURRED/LOGGED ON                      Tue Jul  9 08:48:50 1996
OCCURRED ON SYSTEM                      quix 
SYSTEM ID                 x00060009     CPU TYPE:  DEC 2100 
SYSTYPE                   x00000000
----- UNIT INFORMATION -----
UNIT CLASS                              CPU 
RECORD ENTRY DUMP:
  RECORD BODY
Look at address 0338
0338:   E2000008  00233210  20200503  002000CF        *.....2#...  .. .*
0348:   800150A0  800150A0  02420894  0C140D02        *.P...P....B.....*
0358:   0000000D  00000526  20000000  20000000        *....&...... ... *
At address 0338 the long-word E2000008 means a memory problem, or
at least that is how I understood the technician. He also said
that 40001 is the error code, but I don't see a 40001.
Hope this helps someone.
        Melvin Smith
> CPU EXCEPTIONS are reported for a variety of things.  We see them
> with both of our 2100's on occassion and they are typically 
> single bit correctable memory errors.  Once in a while they are
> bcache errors (correctable).  How to tell what they are is almost 
> impossible  with uerf (can be done if you use uerf -Z and somebody 
> at Digital has told you the "secret" binary codes).  
> dia (decevent, available with 4.0, i believe,  we got an advance 
> copy for v3.2d) does break them out somewhat, but you still have 
> to know that '40001' error status for a memory module is a single 
> bit correctable error.
> 
> Since you had oodles of them one weekend, something was clearly broke
> (needed re-seating, at a minimum)... whether it was cpu board or
> memory depends on further breakdown.  If you continue to get them
> in low volumes (say, less that a couple single bit correctable
> errors a day) you may be ok... or maybe they replaced the wrong thing.
> 
> I'd suggest each time you get a CPU EXCEPTION you analyze what it is
> and insist Digital tell you want it means (and how to analyze for 
> yourself... so you don't call in every single bit correctable).
> We've gone as far to automate nightly summary reports from uerf and
> decevent so we know in the morning what happened the day before.
> Automated human readable uerf reports would be a nice feature for
> Digital to add (though we already did it for ourselves, so we don't
> care about them doing it anymore).
> 
> Hope that helps,  Kurt Carlson, U of Alaska
Received on Wed Jul 10 1996 - 21:25:03 NZST