--- These are caused by fixable bit errors in memory. AlphaStations use ECC (Error Correcting Code) memory, where a few extra bits in each memory word are used to provide enough parity information to allow repair of single-bit errors and detection of double-bit errors in a word of memory. If these are very intermittent, then you probably have little to worry about; the messages above indicate that a single-bit error was successfully corrected so there was no data corruption. If these are very common or the messages indicate uncorrected errors, you may want to replace the memory in the machine. Unfortunately we haven't found out how to trace ECC errors down to a particular memory module, so you may need to replace all the memory or do several changeouts with different sets of modules to eliminate a single bad module. Steve VanDevender <stevev_at_hexadecimal.uoregon.edu> --- Hi, >From the looks of it, an error occured in your main memory but was corrected when the dynamic cache was getting filled, i.e., data wasbeing buffered in it. If this is an isolated incident, it could be nothing serious. However, if it has happened before, it could be indications that your memory chips are going bad. I'll ask around and get you more specific info on it. Hope this helps. Santosh Santosh Krishnan x2815 <santosh_at_heplinux1.uta.edu> --- Sounds like a single-bit memory error that the ECC memory (error correcting) fixed. This is generally a problem with physical memory. I've seen these result from a misaligned memory SIMM and broken SIMMS. If you have hardware support, call DEC and feed them the info you gave us. Good Luck. Ed Jones Internet email: EJONES16_at_ford.com CAD/CAM/PIM Internal email: ejones16_at_cadcam.pms.ford.com 313-845-6068 B220 Suite 100 ALPHA --- Hi, This is a non-fatal memory error (radio-electrical perturbation, etc...) detected and corrected by the cpu. It is a "normal" error message unless you get a lot of them, in this case, may be a bank of memory will become out of order. Patrice. /############################################################################\ # Patrice LEGOUX # #------------------------------------+---------------------------------------# # ADP/Gsi Mini Services | Decnet : GSUV09::LEGOUX # # 4 Rue Sentou +---------------------------------------# # 92150 SURESNES | Tel : +33(0)14625.5054 # # France | Fax : +33(0)147.72.04.99 # # +---------------------------------------# # E-Mail : Patrice.Legoux_at_gsi.fr # # X400 : C=FR; ADMD=ATLAS; PRMD=GSI; O=GSI; S=LEGOUX; G=PATRICE # # Memo : LEGOUX # \############################################################################/ --- It looks like you may have some memory that is going bad. If you have a support contract, talk to DEC and see if you can get them to give you a PAK for DECEvent (it should be free if you are on support). DECEvent will let you trace failing memory down to the SIMM level. Am not sure if your errror is coming from system memory or the L2 cache. In any event, DECEvent should help you trace it down. If you are already using DECEvent, you error log looks pretty detailed for uerf, look at the full listings and it should give you something like this: ----- snip ----- snip ----- snip ----- snip ----- snip ----- ******************************** ENTRY 113 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 5. Timestamp of occurrence 01-OCT-1996 20:42:27 Host name ssdeng System type register x0000000C AlphaServer 8x00 Number of CPUs (mpnum) x00000004 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 100. CPU Machine Check Errors CPU Minor class 4. 620 System Correctable Error --TLaser 620 Corr Error-- Software Flags x00000001 TLSB Error Log Snapshot Packet Present Active CPUs x0000000F Hardware Rev x00000000 System Serial Number ni54600czq Module Serial Number AY55025362 System Revision x00000000 MCHK Reason Mask x00000086 MCHK Frame Rev x00000000 EI STAT xFFFFFFF0C4FFFFFF DATA SOURCE IS MEMORY OR SYSTEM CORRECTABLE ECC ERROR D-ref fill EV5 Chip Rev 4 EI ADDRESS xFFFFFF003300CF7F FILL SYNDROME x0000000000000029 Data Bit = 011 ISR x0000000100100000 Ext. HW interrupt at IPL20 Correctable ECC errors (IPL31) AST requests 3 - 0 x0000000000000000 WHAMI x00 TLSB NODE ID 0. CPU0 MISCR x55 B-Cache Size 4 Mbyte Bcache Two Processors TLSB RUN Signal CPU0 Running console TLDEV x51008014 Device Type Turbo-Laser Dual CPU, 4meg Bcache Device Rev x00005100 TLBER x00440000 CORRECTABLE READ DATA ERROR DATA SYNDROME 2 TLESR0 x00405400 TLESR1 x00400C0C TLESR2 x00602900 ECC Syndrome 0 x00000000 ECC Syndrome 1 x00000029 CORRECTABLE READ ECC ERROR Error Syndrome 0 x00 No Error Error Syndrome 1 x29 Data Bit = 139 TLESR3 x00409090 Palcode Revision x0000000600000301 Palcode Rev: 3.1-1 *TLaser CPU Registers* TLSB Node Number 0. TLDEV x8014 Turbo-Laser Dual CPU, 4meg Bcache TLBER x00440000 CORRECTABLE READ DATA ERROR DATA SYNDROME 2 TLCNR x00000200 TLVID x00000010 TLESR0 x00405400 TLESR1 x00400C0C TLESR2 x00602900 ECC Syndrome 0 x00000000 ECC Syndrome 1 x00000029 CORRECTABLE READ ECC ERROR TLESR3 x00409090 TLEPAERR x00000000 MODCONFIG x00098AD4 Lockout Enable Command Piping To EV5 Disabled Bcache Size: 4 MB Bcache Idle Cycles Before 11. Max Command Queue Entries 2. Max Bus Queue Entries 4. TLEPMERR x00000000 TLEPDERR x00000000 TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable IPL 15 Interrupt Enable IPL 16 Interrupt Enable IPL 17 Interrupt Enable Interprocessor Interrupt Enable Interval Timer Interrupt Enable CPU Halt Enable TLEP Interrupt Summary 0 x00000040 Interval Timer Interrupt Outstanding TLEP Interrupt Mask 1 x00000000 TLEP Interrupt Summary 1 x00000000 *TLaser CPU Registers* TLSB Node Number 1. TLDEV x8014 Turbo-Laser Dual CPU, 4meg Bcache TLBER x00800000 TLCNR x00000210 TLVID x00000032 TLESR0 x00000303 TLESR1 x00000303 TLESR2 x00000303 TLESR3 x00000303 TLEPAERR x00000000 MODCONFIG x00098AD4 Lockout Enable Command Piping To EV5 Disabled Bcache Size: 4 MB Bcache Idle Cycles Before 11. Max Command Queue Entries 2. Max Bus Queue Entries 4. TLEPMERR x00000000 TLEPDERR x00000000 TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable IPL 15 Interrupt Enable IPL 16 Interrupt Enable IPL 17 Interrupt Enable Interprocessor Interrupt Enable Interval Timer Interrupt Enable CPU Halt Enable TLEP Interrupt Summary 0 x00000000 TLEP Interrupt Mask 1 x00000000 TLEP Interrupt Summary 1 x00000000 * TLaser Memory Regs * TLSB Node Number 7. TLDEV x5000 Turbo-Laser Memory Module TLBER x01440000 CORRECTABLE READ DATA ERROR DATA SYNDROME 2 DATA TRANSMITTER DURING ERROR TLCNR x000FC270 TLVID x00000080 FADR x078200003300CF40 FADR 1 x07820000 Failing Command: Read Failing Bank = Bank 8 TLESR0 x00005400 TLESR1 x00000C0C TLESR2 x00212900 ECC Syndrome 0 x00000000 ECC Syndrome 1 x00000029 TRANSMITTER DURING ERROR CORRECTABLE READ ECC ERROR ECC Code x00 Second ECC Code x29 Failing SIMM Number = J17 TLESR3 x00009090 TMIR x80000001 Interleave x00000001 TMCR x0000023D 2GB Module (E2036-AA) 16 MB 70ns DRAM Strings Installed = 8 DRAM timing: Bus Spd = 13.0-15.0; Refresh Cnt = 1008 TMER x00000005 Failing String = x00000005 TMDRA x00000000 Refresh Rate 1X TDDR0 x00000000 TDDR1 x00000000 TDDR2 x00000000 TDDR3 x00000000 * TLaser I/O Registers * TLSB Node Number 8. TLDEV x2020 Turbo-Laser Integrated I/O Module TLBER x00000000 FADR 0 x0000000000000000 FADR 1 x00000000 TLESR0 x00000000 TLESR1 x00000000 TLESR2 x00000000 TLESR3 x00000000 CPU Interrupt Mask x00000001 Cpu Interrupt Mask = x00000001 ICCMSR x00000000 Arbitration Control Minimum Latency Mode Supress Control Suppress after 16 Transations ICCNSE x80000000 Interrupt Enable on NSES Set ICCMTR x00000000 IDPNSE-0 x00000006 Hose Power OK Hose Cable OK IDPNSE-1 x00000006 Hose Power OK Hose Cable OK IDPNSE-2 x00000000 IDPNSE-3 x00000000 IDPVR x00000800 ICCWTR x00000000 TLMBPR x0000000000000000 IDPDR0 x20000000 IDPDR1 x20000000 IDPDR2 x00000000 IDPDR3 x00000000 ----- snip ----- snip ----- snip ----- snip ----- snip ----- As you can see from the "TLaser Memory Regs" section, it will specify the memory down to the module, bank, and SIMM. Some other things to keep in mind are: (1) There will be a certain noise level of ECC recovered errors, even with perfectly good memory. The causes have to do with things like cosmic rays, naturally occuring isotopes in the memory losing neutrons at high speed, EMF, power spikes, and other perfectly normal reasons why bits may get twiddled when they shouldn't. That is why we pay the extra $$ for ECC RAM. I wouldn't get worried unless you start to see a lot of errors from one or two SIMMs. (2) DECEvent (and uerf) seem to belive that these are not memory errors, but are instead CPU errors and you must extract them as such. Here is the command line I used to pull the entry above out: dia -icpu -R -o full > /tmp/delme Please note I haven't said anything about using DECEvent's auto-diagnostic features. That's because it managed to miss a failing simm on our 8400 that had a couple of hundred errors logged on it. DEC is still trying to figure out why it didn't recognize it. Hope this helps, Tom -- +--------------------------------+------------------------------+ | Tom Webster | "Funny, I've never seen it | | SysAdmin MDA-SSD ISS-IS-HB-S&O | do THAT before...." | | webster_at_ssdpdc.mdc.com | - Any user support person | +--------------------------------+------------------------------+ | Unless clearly stated otherwise all opinions are my own. | +---------------------------------------------------------------+ --- You've got an self-correcting parity error on one (or many) of you memory chips. I suppose you got that output from uerf. It's not dangerous as these chips are equipped with single bit error correction hardware. This could worsen if you have 2 bad bits on a memory cell, in that case, the processor will not be able to correct the error and the system would (probably) crash. If that kind of error persists, contact DEC and have them replace the faulting memory board if it's under guarantee (those beasts are NOT CHEAP). We once filled the binary.errlog file with that kind of message. The binary.errlog file filled to 72 Mb in 10 minutes ! Result: File system full, system crawling to its knees.... We cleared the log and got no problems for a month or so... We regularly checked the uerf log and the problem resurfaced a month later, we had DEC replace the memory board. Also, sometimes uerf (incorrectly) report that kind of error as a CPU EXCEPTION due to a bug in the uerf binary.errlog parsing. This should be fixed in revisions later than 3.2D-1. HTH Guy Dallaire dallaire_at_total.net "God only knows if god exists" --- Hi; I recieved the exact same error on the exact same machine type DIGITAL said it was a correctable simm error and to watch for repeats then worry. "replace the sim" I only see two in uerf Scot Scot <scot_at_engrs.infi.net> --- It means you've got a bad memory chip that the ECC circuits are compensating for. You need to get the board or chip replaced. regards, Ross -- Ross Alexander, ve6pdq -- (403) 675 6311 -- rwa_at_cs.athabascau.ca --- It was a message about a fixed memory error. If you got one, it's probably no problem. If you tens a day, you should look into it. ECC memory corrects single bit failures and detects double, but cannot fix them. So if you get lots of these messages you should probably replace your memory. Harald Lundberg <hl_at_tekla.fi>;Tekla Oy,Koronakatu 1,FIN-02210,ESPOO,FINLAND tel +358-{9-8879449work,9-8039489fax,9-8026752,19-2418013res,50-5578303mob) --- -- Nicci Roth OTA Limited Partnership 1 Manhattanville Rd. Purchase, NY 10577 phone: 914/694 5800 fax: 914/694 5831 email: nicci_at_ox.com If you can't find any other meaning in everything that's happening, try to consider it as entertainment.Received on Tue Dec 03 1996 - 19:55:13 NZDT
This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT