CPU Machine Check Errors

From: Bart Rousseau <roussea_at_uia.ua.ac.be>
Date: Tue, 09 Mar 1999 15:51:42 +0100

Hi

One of our AlphaStations (a 600 5/255 with DU3.2) is generating a
lot (about one per minute) of entries in /var/adm/messages
saying 'CPU Machine Check Error'. We have had this problem
already three times before with the same machine (and
once on another AlphaStation). On these occasions Digital has
replaced the board+cpu, the cache memory etc.
Now the problem is back again. It is hard to believe that the
hardware would fail this often, so maybe it is not a
hardware error at all? (btw the machine never crashed)
Looking through the archives I found out the following relating
to this problem:

  - the error is related to faulty memory and not to the cpu
  - use dia instead of uerf
  - this error is not a problem unless it's a recurrent event.

Since in our case its definitly a recurrent event (it occurs about
once every minute or so) it seems we (again) have a problem.
On the other hand, about every part of the machine has already
been replaced at least once... Or maybe they have replaced the wrong
parts... So it it a hardware error or not? What part (cpu, memory)
is causing the problem?

Any ideas would be very much appreciated

Thanks
Bart.
ps this error has occured about 28000 times since october 1998! After
the
last time Digital has done maintenance on this machine the error
disappeared for several weeks so we never checked it again. Now we have
at least learned to check the error logs more ofen, even if everything
seems to be working just fine ;-)


error as reported by dia :


******************************** ENTRY 1
********************************


Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 28237.
Timestamp of occurrence 09-MAR-1999 14:20:44
Host name ...

System type register x0000000F Alcor
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. CPU Machine Check Errors

CPU Minor class 3. Bcache error (630 entry)

Flags: x80000000 Retryable Error
Mchk Error Code x0000000000000086
                                     EV5 Detected Corr ECC Error
EI ADDR xFFFFFF000ED8B95F
FILL SYNDROME x0000000000000015
EI STATUS xFFFFFFF0C4FFFFFF
                                     Error occurred during D-ref fill
ISR x0000000100000000
                                     Correctable ECC errors (IPL31)
                                     AST requests 3 - 0
x0000000000000000
CIA Syndrome x0000000000000000
                                     ECC Syndrome x0000000000000000
MEM ERR0 x0000000000000000
                                     Memory Port Address
x0000000000000000
MEM ERR1 x0000000000000000
                                     Bits <33:32> of Memory Po
x0000000000000000

                                     Bit <39> of Memory Port
x0000000000000000

                                     Memory Command x0000000000000000

                                     Mask When Err Occurred
x0000000000000000

                                     Mem Seq State Idle
                                     Encoded Set Sel: Set 0 Selected
CIA ERR STAT x0000000000000000
                                     Memory Cycle Source is PCI
                                     IO Cmnd/Addr Queue Vld Bi
x0000000000000000

                                     CPU Cmnd/Addr Queue Vld B
x0000000000000000

                                     DM State: Idle
                                     EV5 Resp. for DMA: No Response
CIA ERR x0000000000000000


--
Bart Rousseau, Ph.D. student
University of Antwerp - Dep. of Chemistry
Structural Chemistry  - Quantum Chemistry
http://sch-www.uia.ac.be/struct/quantum/
Received on Tue Mar 09 1999 - 15:06:31 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT