Hardware errors

From: <marco_at_gore.afep.cornell.edu>
Date: Wed, 16 Oct 96 10:16:19 -0400

Hello Managers,
        I am having trouble with my DEC 3000 500X. It crashes occasionally complaining about Hardware Check
Errors. I called DEC about this and they said that it was the cache memory and that the whole system board
(motherboard) would have to be replaced. They couldn't tell which level of cache that it was. They also said
that the processor or the L2 cache chips could not be simply replaced because they couldn't be removed (they are
in sockets however). I can't justify spending $5000 for an infrequent problem which they diagnosed by
deciphering the registers in my uerf ouput. On their recommendation I blew out all the dust from inside the
box. That didn't seem to help. Now the problem is more frequent and I need help before I commit to the $5000.
        In the archives there was a mention about CAM SCSI errors derived from motherboard faults. I am also
getting these errors. Another symptom I am getting is that my cron backup routine reveals bad reads. These
errors do not occur all the time. Sometime there's alot of them (after 16 the dump is aborted), somtimes a few,
sometimes none. These errors could be from bad files.
It only occurs to one file system. These three symptons could be related.
          Attached is my uerf output (uerf -o full -R). I would really appreciate any insight. Thank you in
advance.



Uerf output:


********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Oct 14 18:34:37 1996
OCCURRED ON SYSTEM gore
SYSTEM ID x00020004 CPU TYPE: DEC 3000
SYSTYPE x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

----- KN15AA CPU 630/620 STACK FRAME -----

PROCESSOR OFFSET x00000018
SYSTEM OFFSET x00000048
BIU_STAT x0000000000000340
                                        BIU_CMD CYCLE CLASS IS READ_BLOCK
                                        FILL_ECC PRI. CACHE FILL FROM EXT.
                                         _CACHE HAD ECC ERROR
BIU_ADDR x0000000000108018
                                        PHYSICAL ADDRESS OF CACHE BLOCK WITH ERROR IS x8400
DC_STAT x0000000000000007
                                        DC_HIT LAST LOAD OR STORE MISSED
                                         _DCACHE
                                        OPCODE RA FIELD - INTEGER REGISTER IS R 0.
FILL_SYNDROME x0000000000002C00 SINGLE BIT ERROR IS NO ERRORS
                                        SINGLE BIT ERROR IS DATA BIT 05
FILL_ADDR x00000000042E1548
                                        PHYSICAL ADDRESS OF QUADWORD WITH ERROR x2170AA
BC_TAG x0000000000404295 EXTERNAL CACHE TAG CONTROL BITS
                                         _EXTERNAL CACHE HIT
                                        D BIT - CACHE BLOCK DIRTY
                                        V BIT - CACHE BLOCK VALID
                                        TAG ADDRESS IS x214
                                        EXTERNAL CACHE TAG CONTROL BITS TAG
                                         _ADDRESS PARITY BIT
INT_EXC_IDENT x0000000000000000
                                        INTERRUPT OR EXCEPTION IS NONE

********************************* ENTRY 2. *********************************

----- EVENT INFORMATION -----

EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
SEQUENCE NUMBER 0.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Oct 14 12:01:14 1996
OCCURRED ON SYSTEM gore
SYSTEM ID x00020004 CPU TYPE: DEC 3000
SYSTYPE x00000000
MESSAGE LK401 keyboard, language English
                                         _(American)

{ cropped}

********************************* ENTRY 3. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 2.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Oct 14 11:58:24 1996
OCCURRED ON SYSTEM gore
SYSTEM ID x00020004 CPU TYPE: DEC 3000
SYSTYPE x00000000
MESSAGE panic (cpu 0): Machine check -
                                         _Hardware error


********************************* ENTRY 4. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Oct 14 11:58:24 1996
OCCURRED ON SYSTEM gore
SYSTEM ID x00020004 CPU TYPE: DEC 3000
SYSTYPE x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

----- LEP MACHINE CHECK STACK FRAME -----

PROCESSOR OFFSET x00000110
SYSTEM OFFSET x000001A0
PALTEMP1 x0000000000000000
PALTEMP2 x000C06F800000004
PALTEMP3 x0000000000000000
PALTEMP4 x000000000000000F
PALTEMP5 x0000000000000000
PALTEMP6 x000003FF80017E48
PALTEMP7 x0000000000104000
PALTEMP8 x0000000000000000
PALTEMP9 x0000000000000008
PALTEMP10 xFFFFFC00003C3050
PALTEMP11 x0000000000000000
PALTEMP12 xFFFFFC00003C33F0
PALTEMP13 xFFFFFC00003C3420
PALTEMP14 xFFFFFC00003C3480
PALTEMP15 xFFFFFC00003C31F0
PALTEMP16 xFFFFFC00003C2F00
PALTEMP17 x0000000000000000
PALTEMP18 x000000011FFFF520
PALTEMP19 xFFFFFFFF88B5FA58
PALTEMP20 xFFFFFC00004CDC10
PALTEMP21 x0000000000000000
PALTEMP22 x6068686C7C7C7C7C
PALTEMP23 x00000062000007F9
PALTEMP24 x0000000000000000
PALTEMP25 x0000000000010000
PALTEMP26 x0000000000000000
PALTEMP27 x0000000000000000
PALTEMP28 x000000000191A000
PALTEMP29 xFFFFFFFC00000000
PALTEMP30 x0000000000000001
PALTEMP31 x00000000048CBA58
EXC_ADDR x0000000080017E86
                                        EXCEPTING OR EXECUTING INSTRUCTION DID NOT COMPLETE PC IS xE0005FA1
EXC_SUM x0000000000000000
EXC_MSK x0000000000000000
ICCSR x0000000000000000
                                        PC0 INT ENABLED AFTER 2**16 EVENTS
                                        PC1 INT ENABLED AFTER 2**12 EVENTS
                                        PC0 COUNTER INPUT TOTAL ISSUES DIVIDED
                                         _BY 2
                                        PC1 COUNTER INPUT DCACHE MISSES
                                        FP INSTRUCTIONS CAUSE FEN EXCEPTIONS
                                        ADDRESS SPACE NUMBER = x0
PAL_BASE x0000000000060000
                                        BASE ADDRESS FOR PALCODE = x18
HIER x00000000000018F0
                                        CORRECTABLE READ ERROR INTERRUPT
                                         _ENABLED
                                        CPU HARDWARE INTERRUPT ENABLED ON PIN
                                         _3
                                        CPU HARDWARE INTERRUPT ENABLED ON PIN
                                         _4
                                        CPU HARDWARE INTERRUPT ENABLED ON PIN
                                         _5
                                        PC1 INTERRUPT DISABLED
                                        PC0 INTERRUPT DISABLED
                                        CPU HARDWARE INTERRUPT ENABLED ON PIN
                                         _1
                                        CPU HARDWARE INTERRUPT ENABLED ON PIN
                                         _2
HIRR x0000000000000000
MM_CSR x0000000000003640
                                        INTEGER REGISTER USED IS R 4.
DC_STAT x0000000000000007
                                        DC_HIT LAST LOAD OR STORE MISSED
                                         _DCACHE
                                        OPCODE RA FIELD - INTEGER REGISTER IS R 0.
DC_ADDR x00000000FFFFFFFF SEO SECOND ERROR OCCURRED
ABOX_CTL x000000000000042E
                                        FUNCTIONS ENABLED - MCHECK ENABLED FOR
                                         _UNCORRECTABLE ERRORS
                                        FUNCTIONS ENABLED - CRD CORRECTED READ
                                         _DATA INTERRUPT ENABLED
                                        FUNCTIONS ENABLED - SINGLE ENTRY ICACHE
                                         _STREAM BUFFER ENABLED
                                        FUNCTIONS ENABLED - DCACHE ENABLED
BIU_STAT x0000000000000140
                                        BIU_CMD CYCLE CLASS IS READ_BLOCK
                                        FILL_ECC PRI. CACHE FILL FROM EXT.
                                         _CACHE HAD ECC ERROR
BIU_ADDR x0000000002983520
                                        PHYSICAL ADDRESS OF CACHE BLOCK WITH ERROR IS x14C1A9
BIU_CTL x0000000020007447
                                        EXTERNAL CACHE ENABLED
                                        EXTERNAL CACHE ECC ENABLED
                                        EXTERNAL CACHE READ/WRITE SPEED IN CPU CYCLES IS
                                         _3
                                        EXTERNAL CACHE WRITE ENABLE TIMING BIT FIELD IS x1
FILL_SYNDROME x0000000000000900 SINGLE BIT ERROR IS NO ERRORS
FILL_ADDR x0000000002983520
                                        PHYSICAL ADDRESS OF QUADWORD WITH ERROR x14C1A9
VA x00000000001011F0 D-STREAM FAULT OR DTB MISS - VIRTUAL ADDRESS IS x1011F0
BC_TAG x0000000000002995 EXTERNAL CACHE TAG CONTROL BITS
                                         _EXTERNAL CACHE HIT
                                        D BIT - CACHE BLOCK DIRTY
                                        V BIT - CACHE BLOCK VALID
                                        TAG ADDRESS IS x14C

----- KN15AA CPU SPECIFIC STACK FRAME -----

INT_EXC_IDENT x0000000000000088
                                        INTERRUPT OR EXCEPTION IS NONE
MCR_STAT x0000000011808080 BANK 0 32 MBYTES
                                        BANK 1 32 MBYTES
                                        BANK 2 32 MBYTES
                                        BANK 4 32 MBYTES
IOSLOT x0000000000100000
                                        TURBOCHANEL OPTION SLOT 1 PARITY
                                         _DISABLED
                                        TURBOCHANEL OPTION SLOT 2 PARITY
                                         _DISABLED
                                        TURBOCHANEL OPTION SLOT 4 PARITY
                                         _DISABLED
                                        TURBOCHANEL OPTION SLOT 5 PARITY
                                         _DISABLED
                                        TURBOCHANEL OPTION SLOT 6 PARITY
                                         _DISABLED
                                        TC OPTION SCSI ADAPTER PARITY DISABLED
                                        TC OPTION CORE I/O PARITY DISABLED
                                        TC OPTION CXTURBO PARITY DISABLED
TC_CONFIG x0000000000000016 MAGIC # FOR DMA CONTROL IS x16
                                        PAGE SIZE IS 8KBYTES
IR x000000000007FE00
                                        SECOND ERROR OCCURED
                                        DMA BUFFER ERROR - UNDER/OVER FLOW
                                        CROSSED 2K BOUNDARY ON DMA
                                        TC RESET IN PROGRESS
                                        TC PARITY ERROR
                                        TAG ERROR DURING DMA
                                        SINGLE BIT ERROR ON I/O WRITE OR DMA
                                         _READ
                                        DOUBLE BIT ERROR ON I/O WRITE OR DMA
                                         _READ
                                        TC TIMEOUT ON I/O REQUEST

--------------
{another system startup}
--------------

********************************* ENTRY 6. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 5.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sun Oct 13 21:17:08 1996
OCCURRED ON SYSTEM gore
SYSTEM ID x00020004 CPU TYPE: DEC 3000
SYSTYPE x00000000
MESSAGE panic (cpu 0): Machine check -
                                         _Hardware error
Received on Wed Oct 16 1996 - 21:02:09 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT