QUESTION: ECC errors

From: Sean O'Connell <sto_at_stat.Duke.EDU>
Date: Thu, 27 Mar 1997 17:05:53 -0500 (EST)

Greetings-

I have seen several recent postings dealing with the problem of
cpu exception logging due from ECC errors. I have three
AlphaStation 500/266's which have logged a large number of these
sorts of errors.

The following is an example of such an error message:
> dia -R -icpu -o full

DECevent V2.1

******************************** ENTRY 1 ********************************

Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 4.
Timestamp of occurrence 27-MAR-1997 11:48:43
Host name boninsegna

System type register x0000000F Alcor
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. CPU Machine Check Errors

CPU Minor class 3. Bcache error (630 entry)

Flags: x80000000 Retryable Error
Mchk Error Code x0000000000000086
                                EV5 Detected Corr ECC Error
EI ADDR xFFFFFF0004BC0CEF
FILL SYNDROME x0000000000000068
EI STATUS xFFFFFFF0C4FFFFFF
                                Error occurred during D-ref fill
ISR x0000000100000000
                                Correctable ECC errors (IPL31)
                                AST requests 3 - 0 x0000000000000000
CIA Syndrome x0000000000000000
                                ECC Syndrome x0000000000000000
MEM ERR0 x0000000000000000
                                Memory Port Address x0000000000000000
MEM ERR1 x0000000000000000
                                Bits <33:32> of Memory Po x0000000000000000


                                Bit <39> of Memory Port x0000000000000000

                                Memory Command x0000000000000000

                                Mask When Err Occurred x0000000000000000

                                Mem Seq State Idle
                                EV5 Resp. for DMA: No Response
CIA ERR x0000000000000000

Taking the advice of a previous posting, I brought the machine
down to the console level (upgraded firware to as500_v6_4.exe
(v3_9 cdrom) and did the following:

>>> set d_group field
>>> memory
>>> showit

Then reams of test output went by. This machine has 512MB of memory
(all Dataram). First before I show a sample error message, how many
passes is this supposed to do? Is this in the infinite loop mode, or
does it does write and read a multiple of the main memory?

In between outputting the test summary, I was treated to a blur of
messages similar to (there were a few different EI_ADDR values)
the following (this is a hand-scribbled note):

Processor correctable error through vector 00000063

EI_STAT: FFFFFFF0C4FFFFFF EI_ADDR: FFFFFF0004Bxxxx
FILL_SYN: 0000000000000068 ISR: 0000000100000000 MCES 4
 databit 59 J26 bank0
page# 2561 base 9e

HELP! Is this indicative of a bad B-cache (I am assuming that is
the tertiary 2MB cache)? or is one the DIMMS bad. I need to know
whether this is a DEC matter or a Dataram matter, so that i can resolve
this without finger-pointing back-and-forth. I imagine that all fo
this logging interferes with the speed of the system and eats up
disk space.

As a further question, when i bring these machines up after they have
been powere off, show config says that the tested memory is only 33MB.
Is this common? correct?

Thanks so much. Any insight on this matter would be greatly appreciated.

Sean

*************************************************************************
* Sean O'Connell *
* Computer Projects Manager *
* Duke University Institute of Statistics and Decision Sciences *
*************************************************************************
* Phone: (919) 684-5419 *
* Fax: (919) 684-8594 *
* Email: sto_at_stat.Duke.EDU *
* Mail: 220 Old Chemistry Building *
* P.O. Box 90251 *
* Durham NC 27708-0251 *
*************************************************************************
Received on Thu Mar 27 1997 - 23:20:40 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT