Memory errors

From: Vladas Lapinskas <lapinskas_at_mail.iae.lt>
Date: Thu, 01 Apr 1999 14:17:12 +0300 (EEST)

        Dear managers!

        I'm administering a couple of Alphas (5 Workstations model 3000
and 2 Servers 2100). It seems that I have memory errors on one workstation
and one server. I need your help to decide to order or not a new
memory chips, and which one should be replaced. I will try to explain in
details, what information I have, so please excuse my long letter (and my
ignorance in hardware)

On one workstation I have follow messages in /var/adm/messages

--------------------------------------------------------------------------
Dec 16 17:15:01 ignosc3 vmunix: Memory error corrected by processor
Dec 16 17:15:01 ignosc3 vmunix: biu_stat = 0000000000001b40
Dec 16 17:15:01 ignosc3 vmunix: biu_addr = 00000001d4000018
Dec 16 17:15:01 ignosc3 vmunix: dc_stat = 0000000000000003
Dec 16 17:15:01 ignosc3 vmunix: fill_syndrome = 0000000000000015
Dec 16 17:15:01 ignosc3 vmunix: fill_addr = 0000000002b39740
Dec 16 17:15:01 ignosc3 vmunix: bc_tag = 0000000000402c12
Dec 16 17:15:01 ignosc3 vmunix: ident = 0
--------------------------------------------------------------------------

        This occures about once a two months. What does it means - bad
memory chip, or something else? I have run test mem on boot monitor a
number of times, but did not find any memory problem. Should I worry about
this? Should I replace the chip (and how could I found which one?)



        On one server during boot test I have a message

--------------------------------------------------------------------------
Testing Memory bank 0

***Error - Memory Board 2 ***
Failing address: 005c0820
Bank Number: 0
ASIC ID: 0
Error Type: 0
Error Syndrome: 000006c7

Configuring Memory Modules
....
Memory Testing and Configuration Status
Module Size Base Addr Intlv Mode Intlv Unit Status
------ ----- --------- ---------- ---------- ------
  2 128MB 00000000 2-Way 0 Passed
  3 128MB 00000000 2-Way 1 Passed
Total Bad Pages 1
--------------------------------------------------------------------------


        And this error occures not every time during reboot, about half
times (We do not reboot server often, but I have played a little). After
getting to the boot monitor the show error command gives me the folowing

--------------------------------------------------------------------------
MEM2 Module EEROM Event Log

Test Directed Errors

No Entries Found

Symptom Directed Errors

Entry Fail Address Bits/Syndrome Bank # ASIC # Source Event
Type
  00 005c0220 06c7 0 0 1 00
  01 005c0820 06c7 0 0 1 00
  02 005c0620 06c7 0 0 1 00
--------------------------------------------------------------------------

      and the command show fru gives me the following

--------------------------------------------------------------------------
                                 Rev Events logged
 Slot Option Part# Hw Sw Serial# SDD TDD
   0 IO B2110-AA K3 0 AY52803390 00 00
   1 CPU2 B2040-AB B1 37 AY63504963 00 00
   2 CPU0 B2040-AB B1 37 AY60916498 00 00
   3 CPU1 B2040-AB B1 37 AY62702853 00 00
   5 CPU3 B2040-AB B1 0 AY62121147 00 00
   6 MEM2 B2021-CA B1 0 AY45013015 03 00
   7 MEM3 B2022-DA B1 0 AY53407756 00 00

 Slot Option Hose 0, Bus 0, PCI on Standard I/O
   6 DECchip 21040-AA PCI Option Slot 0
   8 DEC PCI FDDI PCI Option Slot 2

 Slot Option Hose 0, Bus 1, EISA on Standard I/O
   2 CPQ3111

--------------------------------------------------------------------------

        So, I think I should replace memory chip on the server in slot 6,
and probably replace memory in the workstation. Am I correct? How could I
know which chip should I replace in the workstation? Thank you in advance!

---
Vladas Lapinskas, mailto:lapinskas_at_mail.iae.lt
Received on Thu Apr 01 1999 - 12:21:09 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT