Summary Alpha 2100 CPU exceptions

From: Robert Honore <robert_at_digi-data.com>
Date: Fri, 26 Jul 1996 08:48:04 -0400

Dear Managers,
        All my humblest apologies for my rather unclear summary.
A more complete version appears below.

 It WAS a memory problem. I called
the guys at Digital support and they gave me the info I needed to
find out what module was the culprit.
        The UERF utility does not parse the error entries
properly and mistakes memory errors for cpu errors. From the
information I received from the Digital guy I found out that the
machine was trying to report uncorrectable memory errors.
        Thanks to the following for pointing me in the correct
direction.

        Claude Soma soma_c_at_decus.fr
        Clifford Krieger ckrieger_at_psi.prc.com
        Isaac Oribioye I.O.Oribioye_at_herts.ac.uk
        Jim Skoog skoog_at_netcom.com
        Karl Marble kmarble_at_ultranet.com
        Kurt Carlson SXKAC_at_orca.alaska.edu
        Kurt Wild Kurt.Wild_at_ska.com
        Melvin Smith msmith_at_quix.robins.af.mil
        Nick Hill NMH1_at_axpr11.r1.ac.uk

No one was able to tell me where I may obtain documentation on
how i may go about analysing that information for myself, though.

My original question was:



Hello managers,
        I am writing again to enlist your aid in solving a very
perplexing problem. I have an AlphaServer 2100 machine running
Digital Unix 3.0b with 512 MB of RAM and 8 RZ28 disk drives and
the machine is filling up its binary errorlog file
(/var/adm/binary.errlog) with entries of the form:

*********************** ENTRY 1. ************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Jul 15 04:15:00 1996
OCCURRED ON SYSTEM orpheus
SYSTEM ID x00000009 CPU TYPE: DEC 2100
SYSTYPE x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

----- LEP MACHINE CHECK STACK FRAME -----

PROCESSOR OFFSET x00000110
SYSTEM OFFSET x000001A0
PALTEMP1 x0000000140002058
PALTEMP2 x000782F800000004
PALTEMP3 x0000000000000001
PALTEMP4 x0000000000000000
PALTEMP5 x0000000000000000
PALTEMP6 x0000000000000240
PALTEMP7 x0000000000004200
PALTEMP8 x0000000000000400
PALTEMP9 x0000000000000000
PALTEMP10 xFFFFFC000044D630
PALTEMP11 x0000000000000000
PALTEMP12 xFFFFFC000044D9C0
PALTEMP13 xFFFFFC000044D9F0
PALTEMP14 xFFFFFC000044DA50
PALTEMP15 xFFFFFC000044D7D0
PALTEMP16 xFFFFFC000044D4F0
PALTEMP17 x00000000000192D0
PALTEMP18 x000000011FFFFB70
PALTEMP19 xFFFFFFFFB1F1BA58
PALTEMP20 xFFFFFC00005A39B0
PALTEMP21 x0000000000000000
PALTEMP22 x40424272727E7E7E
PALTEMP23 xFFCFDFFBBFFFBEE5
PALTEMP24 x0000000000000000
PALTEMP25 x0000000000010000
PALTEMP26 x0000000000000000
PALTEMP27 x0000000000000000
PALTEMP28 x00000000178A8000
PALTEMP29 xFFFFFFFC00000000
PALTEMP30 x0000000000000001
PALTEMP31 x00000000041AFA58
EXC_ADDR x00000000200049F0
                     EXCEPTING OR EXECUTING INSTRUCTION DID NOT
COMPLETE PC IS x4800127C
EXC_SUM x0000000000000000
EXC_MSK x0000000000000000
ICCSR x0000000000000004
                                PC0 INT ENABLED AFTER 2**16
EVENTS
                                PC1 INT ENABLED AFTER 2**8 EVENTS
                                PC0 COUNTER INPUT TOTAL ISSUES
DIVIDED BY 2
                                PC1 COUNTER INPUT DCACHE MISSES
                                FP INSTRUCTIONS CAUSE FEN
EXCEPTIONS
                                ADDRESS SPACE NUMBER = x0
PAL_BASE x0000000000014000
                                BASE ADDRESS FOR PALCODE = x5
HIER x0000000000001CF0
                                CORRRECTABLE READ ERROR INTERRUPT
ENABLED
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 3
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 4
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 5
                                PC1 INTERRUPT DISABLED
                                PC0 INTERRUPT DISABLED
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 0
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 1
                                CPU HARDWARE INTERRUPT ENABLED ON
PIN 2
HIRR x0000000000000000
MM_CSR x0000000000003640
                                INTEGER REGISTER USED IS R4.
DC_STAT x0000000000000007
                                DC_HIT LAST LOAD OR STORE MISSED
_DCACHE
                                OPCODE RA FIELD - INTEGER
REGISTER IS R0.
DC_ADDR x00000000FFFFFFFF SE O SECOND ERROR OCCURRED
ABOX_CTL x000000000000142E
                                FUNCTIONS ENABLED - MCHECK
ENABLED FOR UNCORRECTABLE ERRORS
                                FUNCTIONS ENABLED - CRD CORRECTED
READ _DATA INTERRUPT ENABLED
                                FUNCTIONS ENABLED - SINGLE ENTRY
ICACHE _STREAM BUFFER ENABLED
                                FUNCTIONS ENABLED - DCACHE
ENABLED
BIU_STAT x0000000000000240
                                BIU_CMD CYCLE CLASS IS READ_BLOCK
BIU_ADDR x00000000000192D0
                                PHYSICAL ADDRESS OF CACHE BLOCK
WITH ERROR IS xC96
BIU_CTL x0000000030006477
                                EXTERNAL CACHE ENABLED
                                EXTERNAL CACHE ECC ENABLED
                                EXTERNAL CACHE READ/WRITE SPEED
IN CPU CYCLES IS _3
                                EXTERNAL CACHE WRITE ENABLE
TIMING BIT FIELD IS x4001
FILL_SYNDROME x0000000000000000 SINGLE BIT ERROR IS NO ERRORS
                                SINGLE BIT ERROR IS NO ERRORS
FILL_ADDRESS x0000000000006100
                                PHYSICAL ADDRESS OF QUADWORD WITH
ERROR x308
VA x0000000000006170 D-STREAM FAULT OR DTB MISS -
VIRTUAL ADDRESS IS x6170
BC_TAG x0000000024961248
                                S BIT - CACHE BLOCK SHARED TAG
ADDRESS IS xB092

----- DIGITAL 2100 A500 CPU SPECIFIC FRAME -----

BCC_CSR0 x0000000000000220
                                FILL WRONG DUP TAG STORE PAR ENB
B-CACHE COND I/O UPDATES
BCCE_CSR1 x000001A000000110
BCCEA_CSR2 x000000010000008A
BCUE_CSR3 x0000000040002058
                                UNCORRECTABLE ERROR
                                EDC SYNDROME 0 x0
                                EDC SYNDROME 2 x20
                                EDC SYNDROME 1 x0
                                EDC SYNDROME 3 x0
BCUEA_CSR4 x0000000000000004 B-CACHE MAP OFFSET x4
                                TAG VALUE x0
                                B-CACHE MAP OFFSET H x182F8
                                PREDICTED TAG PARITY H
                                TAG PARITY H
                                TAG VALUE H x0
DTER_CSR5 x0000000000000001 MISSED ERROR OCCURRED
                                DUP TAG STORE OFFSET x0
                                DUP TAG x0
                                DUP TAG STORE OFFSET H x0
                                DUP TAG H x0
CBCTL_CSR6 x0000000000000000
                                C/A WRONG PARITY x0
                                COMMANDER ID x0
                                ARB CONTROL MASK x0
                                C/A WRONG PARITY H x0
                                COMMANDER ID H x0
                                ARB CONTROL MASK H x0
CBE_CSR7 x0000000000000000
                                MISS COUNT x0
                                MISS COUNT H x0
CBEAL_CSR8 x0000000000000240
                                ADDRESS x90
                                ADDRESS H x0
CBEAH_CSR9 x0000000000004200
PMBX_CSR10 x0000000000000400
IPIR_CSR11 x0000000000000000
SIC_CSR13 xFFFFFC000044D630
ADLK_CSR13 x0000000000000000
MADRL_CSR14 xFFFFFC000044D9C0
CRREV4 xFFFFFC000044D9F0

----- DIGITAL 2100 A500P T2 SPECIFIC FRAME -----

IOCSR x0000000000000000
CERR1 xE3800010E3800010
CERR2 x0020004320200043
CERR3 x0000000000000000
PERR1 x000000064061A3C0
HAE0_1 x0000000000000000
HAE0_2 x00000000400807FF
WBASE1 x000000003FF00000
WMASK1 x0000000000000000
TBASE1 x00000000000C00FF
WBASE2 x000000000FF00000
WMASK2 x0000000000460000
TBASE2 x0000000000000000
TLBBR x0000002400000000
IVR x0000000000000000
HAE0_3 x0000000000000003
HAE0_4 x0000000000000000
TDR0 x0000002400000000
TDR1 x0000000000000000
TDR2 x0000000000000003
TDR3 x0000000000000000
TDR4 x0000000000000000
TDR5 x0000000000000000
TDR6 x0000000000000000
TDR7 x0000005800000008

----- DIGITAL 2100 A500 MEMORY SPECIFIC FRAME -----

MODULE NUMBER x0000000000000000
MERR xE2000008E2000008
MCMD1 x0020004320200043
MCMD2 x800150A0800150A0
MCONF x0EC4055F0EC90669
MEDC1 x000000170000000D
MEDC2 x2000000200000000
MEDCC x0000080000000800
MREF x0000000000000000
FILTER x0000005800000008

----- DIGITAL 2100 A 500 MEMORY SPECIFIC FRAME -----

MODULE NUMBER x0000000000000000
MERR xE2400008E2400008
MCMD1 x0020004320200043
MCMD2 x8201505182015051
MCONF x01CB06F10CB60A7B
MEDC1 x000000170000000D
MEDC2 x2000000020000000
MEDCC x0000080000000800
MSCTL x000001D8000001D8
MREF x0000000000000000
FILTER x0000005800000008

----- DIGITAL 2100 A500 MEMORY SPECIFIC FRAME -----

MODULE NUMBER x0000000000000000
MERR xE2400008E2800008
MCMD1 x0020004320200043
MCMD2 x8401505284015052
MCONF x01CB06F10CB61A7B
MEDC1 x000000170000000D
MEDC2 x2000000020000000
MEDCC x0000080000000800
MSCTL x000001D8000001D8
MREF x0000000000000000
FILTER x0000005800000008

----- DIGITAL 2100 A500 MEMORY SPECIFIC FRAME -----

MODULE NUMBER x0000000000000001
MERR xE2C00008065B7280
MCMD1 x00200043002000DF
MCMD2 x8601505386015053
MCONF x0C1405FB0C140F4E
MEDC1 x00000017000004DF
MEDC2 x2000000020000000
MEDCC x0000080000000800
MSCTL x000001D8000001D8
MREF x0000000000000000
FILTER x0000000000000000

I have the full error report from the system corresponding to the
output from the command uerf -R -o full -c err . I did a check
to see how many entries like this there were in the file and
found 3085 such entries all recorded in a space of 5 minutes.
The binary errorlog file has grown to over 400 MB in size!!!

I have replaced the CPU module and the problem still seems to
occur as of 15-Jul-1996 09:00:00. Any ideas?
        Also, can anyone suggest to me where I may obtain
documentation to allow me to analyse these error entries for
myself?


-- 
Yours sincerely,
Robert Honore
robert_at_digi-data.com
Phone: 623 6658 Fax: 623 0978
Snail Mail: Digi Data systems limited, 96 Wrightson Road,
Trinidad, W. I.
> If one didn't have to WORK for a living, WORK would be MUCH MORE FUN!
Received on Fri Jul 26 1996 - 15:24:59 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT