Hi
One of our AlphaStations (a 600 5/255 with DU3.2) is generating a
lot (about one per minute) of entries in /var/adm/messages
saying 'CPU Machine Check Error'. We have had this problem
already three times before with the same machine (and
once on another AlphaStation). On these occasions Digital has
replaced the board+cpu, the cache memory etc.
Now the problem is back again. It is hard to believe that the
hardware would fail this often, so maybe it is not a
hardware error at all? (btw the machine never crashed)
Looking through the archives I found out the following relating
to this problem:
- the error is related to faulty memory and not to the cpu
- use dia instead of uerf
- this error is not a problem unless it's a recurrent event.
Since in our case its definitly a recurrent event (it occurs about
once every minute or so) it seems we (again) have a problem.
On the other hand, about every part of the machine has already
been replaced at least once... Or maybe they have replaced the wrong
parts... So it it a hardware error or not? What part (cpu, memory)
is causing the problem?
Any ideas would be very much appreciated
Thanks
Bart.
ps this error has occured about 28000 times since october 1998! After
the
last time Digital has done maintenance on this machine the error
disappeared for several weeks so we never checked it again. Now we have
at least learned to check the error logs more ofen, even if everything
seems to be working just fine ;-)
error as reported by dia :
******************************** ENTRY 1
********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 28237.
Timestamp of occurrence 09-MAR-1999 14:20:44
Host name ...
System type register x0000000F Alcor
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 3. Bcache error (630 entry)
Flags: x80000000 Retryable Error
Mchk Error Code x0000000000000086
EV5 Detected Corr ECC Error
EI ADDR xFFFFFF000ED8B95F
FILL SYNDROME x0000000000000015
EI STATUS xFFFFFFF0C4FFFFFF
Error occurred during D-ref fill
ISR x0000000100000000
Correctable ECC errors (IPL31)
AST requests 3 - 0
x0000000000000000
CIA Syndrome x0000000000000000
ECC Syndrome x0000000000000000
MEM ERR0 x0000000000000000
Memory Port Address
x0000000000000000
MEM ERR1 x0000000000000000
Bits <33:32> of Memory Po
x0000000000000000
Bit <39> of Memory Port
x0000000000000000
Memory Command x0000000000000000
Mask When Err Occurred
x0000000000000000
Mem Seq State Idle
Encoded Set Sel: Set 0 Selected
CIA ERR STAT x0000000000000000
Memory Cycle Source is PCI
IO Cmnd/Addr Queue Vld Bi
x0000000000000000
CPU Cmnd/Addr Queue Vld B
x0000000000000000
DM State: Idle
EV5 Resp. for DMA: No Response
CIA ERR x0000000000000000
--
Bart Rousseau, Ph.D. student
University of Antwerp - Dep. of Chemistry
Structural Chemistry - Quantum Chemistry
http://sch-www.uia.ac.be/struct/quantum/
Received on Tue Mar 09 1999 - 15:06:31 NZDT