ECC errors on AlphaStations

From: Sean O'Connell <sto_at_stat.Duke.EDU>
Date: Wed, 25 Jun 1997 09:08:39 -0400 (EDT)

Greetings-

I have put together a summary of how to interpret the ECC memory
events. I am putting out this summary to hold it up to scrutiny of
the alpha community---ie, if I have made any horrible assumptions
please tell me (in particular, are the assumptions about the
memory allocations way offbase? this seems to put the error on
the DIMM reported by the SRM and seems to account for the case
where the DIMM was moved from J26 to J27).

These are logged as CPU exceptions in /var/adm/binary.errlog. Uerf
only shows them as this; however, dia does a much better job of
pointing you in the right direction. They also (depending on your
syslog setup) go to /var/adm/syslog.dated/<date>/kernel.log and
/var/adm/messages.

They look like:
Machine Check error corrected by processor
Physical address of error ffffff0014be08ef Corrected ECC Error in\
     Memory during D-Cache fill
Fill Syndrome = 0000000000000068
Single Bit error in Quadword 0 at bit<59> in a Data bit
EI Address = ffffff0014be08ef
EI Status = fffffff0c4ffffff
Interrupt Status Reg = 0000000100000000
ECC Syndrome = 0000000000000000
Memory Port 0 Status Reg = 0000000000000000
Memory Port 1 Status Reg = 0000000000000000
CIA Error Status = 0000000000000000
CIA Error Reg = 0000000000000000

------------------------------------------------------------
ECC Error Checks
------------------------------------------------------------
SRM TEST (I know this works):
To figure out which DIMM is misbehaving from the SRM, run
the memory test, courtesy of posting to alpha-osf-managers
list by Knut Hellebro <Knut.Hellebo_at_nho.hydro.com>:

>>> set d_group field
>>> memory
>>> showit

F1 will freeze the out put (it toggles it) if a lot of stuff
goes by.

The error if there is one will make reference to a jumper
location, e.g. <J26>. This corresponds to one of the 8
memory sockets on the motherboard.

>From left to right, looking in from the front of the
computer:

JUMPER# J25 J22 J26 J27 J28 J29 J30 J23
BANK A B A B A B A B
------------------------------------------------------------
MAP TEST (not as sure here):
You can also convert the Physical location of the error to a
jumper location:

Physical address of error ffffff0003235fff
                                  --------
Use the rightmost 8 to convert from hexidecimal to decimal
for the byte location of the error:

0x16^7+3x16^6+2x16^5+3x16^4+5x16^3+15x16^2+15x16^1+15x16^0

= 52649983 = 50.21MB (divide by 1024x1024)

-------------------------------------------------------------------
Since these are DIMMs (dual in line), split the DIMM density
in half (assuming 16MB DIMMs in both Bank A and Bank B):

Populate Bank A Populate Bank B
           i ii i ii
J30 -> 0-8 and 32-40 J23 -> 64-72 and 96-104
J28 -> 8-16 and 40-48 J29 -> 72-80 and 104-112
J26 -> 16-24 and 48-56 J27 -> 80-88 and 112-120
J25 -> 24-32 and 56-64 J22 -> 88-96 and 120-128

Therefore, the bad dimm would be the one in J26 or the
3rd DIMM from the left.
-------------------------------------------------------------------
I have verified this in two cases:
<CASE 1> 16MB DIMMs (use above)
Error in J26 -> ffffff000303f19f -> 50590111 = 48.25MB (agrees)

error message >
Machine Check error corrected by processor
Physical address of error ffffff000303f19f Corrected ECC Error in\
    Memory during D-Cache fill
Fill Syndrome = 000000000000004a
Single Bit error in Quadword 0 at bit<33> in a Data bit
EI Address = ffffff000303f19f
EI Status = fffffff0c4ffffff
Interrupt Status Reg = 0000000100000000
ECC Syndrome = 0000000000000000
Memory Port 0 Status Reg = 0000000000000000
Memory Port 1 Status Reg = 0000000000000000
CIA Error Status = 0000000000000000
CIA Error Reg = 0000000000000000
-------------------------------------------------------------------
<CASE 2> 64MB DIMMs:

Populate Bank A Populate Bank B
           i ii i ii
J30 -> 0-32 and 128-160 J23 -> 256-288 and 384-416
J28 -> 32-64 and 160-192 J29 -> 288-320 and 416-448
J26 -> 64-96 and 192-224 J27 -> 320-352 and 448-480
J25 -> 96-128 and 224-256 J22 -> 352-384 and 480-512

Error in J26 -> ffffff0004bebcef -> 79609071 = 75.92MB (agrees)

and moved it to J27

Error in J27 -> ffffff0014bdecef -> 347991279 = 331.87MB (agrees)

error message >
Machine Check error corrected by processor
Physical address of error ffffff0014bce4ef Corrected ECC Error in\
    Memory during D-Cache fill
Fill Syndrome = 0000000000000068
Single Bit error in Quadword 0 at bit<59> in a Data bit
EI Address = ffffff0014bce4ef
EI Status = fffffff0c4ffffff
Interrupt Status Reg = 0000000100000000
ECC Syndrome = 0000000000000000
Memory Port 0 Status Reg = 0000000000000000
Memory Port 1 Status Reg = 0000000000000000
CIA Error Status = 0000000000000000
CIA Error Reg = 0000000000000000
-------------------------------------------------------------------

Hope this helps,
Sean

*************************************************************************
* Sean O'Connell *
* Computer Projects Manager *
* Duke University Institute of Statistics and Decision Sciences *
*************************************************************************
* Phone: (919) 684-5419 *
* Fax: (919) 684-8594 *
* Email: sto_at_stat.Duke.EDU *
* Mail: 220 Old Chemistry Building *
* P.O. Box 90251 *
* Durham NC 27708-0251 *
* *
* "torture your data long enough, and it will confess to anything" *
*************************************************************************
Received on Wed Jun 25 1997 - 15:53:54 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT