DIA: cpu or memory error?

From: Xu, Ying <Ying.Xu_at_telecheck.com>
Date: Thu, 17 Feb 2000 16:06:02 -0600

Hi, Dear managers,

We got some system correctable errors on several CPUs on our Alpha 8400
(4.0E patch kit 1). Decevent shows low priority CPU machine Check Errors.
It also points to memory errors. I doubt it is memory error instead of CPU
errors because of the number of CPUs involved. My question is:

1. What does the error really mean? How serious is it?
2. What's the fix for that?

I attached error log from /var/adm/messages and DIA output below.

Thank you very much for your help.

Ying Xu
ITSS - System Management
email:ying.xu_at_telecheck.com
phone: 713-331-6503

------------------------------------------------
/var/adm/messages

Feb 9 22:36:49 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:36:55 dware01 vmunix: System correctable error count on cpu 4
exceeds
threshold
Feb 9 22:36:55 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:36:58 dware01 vmunix: System correctable error count on cpu 1
exceeds
threshold
Feb 9 22:36:58 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:37:23 dware01 vmunix: System correctable error count on cpu 2
exceeds
threshold
Feb 9 22:37:23 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:37:29 dware01 vmunix: System correctable error count on cpu 6
exceeds
threshold
Feb 9 22:37:29 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:37:52 dware01 vmunix: System correctable error count on cpu 7
exceeds
threshold
Feb 9 22:37:52 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:40:34 dware01 vmunix: System correctable error count on cpu 5
exceeds
threshold
Feb 9 22:40:34 dware01 vmunix: Reporting suspended for 5 min
Feb 9 22:41:19 dware01 vmunix: System correctable error count on cpu 3
exceeds
threshold


"dia -o full -R -i cpu"

DECevent V2.9


******************************** ENTRY 8 ********************************



Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 2473.
Timestamp of occurrence 10-FEB-2000 13:49:19
Host name dware01

System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000008
CPU logging event (mperr) x00000007

Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors

CPU Minor class 4. System Correctable Error (620)

--TLaser 620 Corr Error--
Software Flags x00000001 TLSB Error Log Snapshot Packet Present
Active CPUs x000000FF
Hardware Rev x00000000
System Serial Number NI815AB979
Module Serial Number AY74733840
System Revision x00000000
MCHK Reason Mask x00000086
MCHK Frame Rev x00000001
EI STAT xFFFFFFF0C5FFFFFF
                                     DATA SOURCE IS MEMORY OR SYSTEM
                                     CORRECTABLE ECC ERROR
                                     D-ref fill
                                     EV5 Chip Rev 5
EI ADDRESS xFFFFFF0129B2A05F
FILL SYNDROME x00000000000000AD
                                     Data Bit = 045
ISR x0000000100000000
                                     Correctable ECC errors (IPL31)
                                        AST requests 3 - 0
x0000000000000000
WHAMI x00 TLSB NODE ID 0.
                                     CPU0
MISCR xD5 B-Cache Size 4 Mbyte Bcache
                                     Two Processors
                                     TLSB RUN Signal
                                     CPU0 Running console
                                     CPU1 Running console
TLDEV x76008014 -- Device Type: Dual EV5/6 Proc,
                                                        625Mhz, 4meg Bcache
TLBER x00440000 CORRECTABLE READ DATA ERROR
                                     DATA SYNDROME 2
TLESR0 x004000B0
TLESR1 x00400C0C
TLESR2 x00A0AD00 ECC Syndrome 0 x00000000
                                     ECC Syndrome 1 x000000AD
                                     CORRECTABLE READ ECC ERROR

  Error Syndrome 0 x00 No Error
  Error Syndrome 1 xAD Data Bit = 173

TLESR3 x00409090
Palcode Revision x0000000700000502
                                     Palcode Rev: 5.2-7

TLSB Base Adr x0000000000000000

*TLaser CPU Registers*
TLSB Node Number 0.
TLDEV x76008014 -- Device Type: Dual EV5/6 Proc,
                                                        625Mhz, 4meg Bcache

TLBER x00440000 CORRECTABLE READ DATA ERROR
                                     DATA SYNDROME 2
TLCNR x00000200
TLVID x00000010
TLESR0 x004000B0
TLESR1 x00400C0C
TLESR2 x00A0AD00 ECC Syndrome 0 x00000000
                                     ECC Syndrome 1 x000000AD
TLESR3 x00409090
TLEPAERR x00600000 Second ADG Design: Rev A
MODCONFIG x00E088C4 Bcache Size: 4 MB
                                     Bcache Idle Cycles Before 3.
                                     Max Command Queue Entries 2.
                                     Max Bus Queue Entries 4.
TLEPMERR x00000000
TLEPDERR x00000000
TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable
                                     IPL 15 Interrupt Enable
                                     IPL 16 Interrupt Enable
                                     IPL 17 Interrupt Enable
                                     Interprocessor Interrupt Enable
                                     Interval Timer Interrupt Enable
                                     CPU Halt Enable
TLEP Interrupt Summary 0 x00000040 Interval Timer Interrupt Outstanding
TLEP Interrupt Mask 1 x00000000
TLEP Interrupt Summary 1 x00000000


*TLaser CPU Registers*
TLSB Node Number 1.
TLDEV x76008014 -- Device Type: Dual EV5/6 Proc,
                                                        625Mhz, 4meg Bcache

TLBER x00400000
TLCNR x00000210
TLVID x00000032
TLESR0 x004000B0
TLESR1 x00000303
TLESR2 x0080AD00
TLESR3 x00000303
TLEPAERR x00600000 Second ADG Design: Rev A
MODCONFIG x00E088C4 Bcache Size: 4 MB
                                     Bcache Idle Cycles Before 3.
                                     Max Command Queue Entries 2.
                                     Max Bus Queue Entries 4.
TLEPMERR x00000000
TLEPDERR x00000000
TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable
                                     IPL 15 Interrupt Enable
                                     IPL 16 Interrupt Enable
                                     IPL 17 Interrupt Enable
                                     Interprocessor Interrupt Enable
                                     Interval Timer Interrupt Enable
                                     CPU Halt Enable
TLEP Interrupt Summary 0 x00000000
TLEP Interrupt Mask 1 x00000000
TLEP Interrupt Summary 1 x00000000


*
Received on Thu Feb 17 2000 - 22:06:59 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT