missing CPUs

From: <bsramsey_at_CCGATE.HAC.COM>
Date: Tue, 11 Nov 1997 15:11:06 +0000

     We have 4 DEC Alpha dual-processor 2100A's running DU 4.0b...three of
     them have 'lost' one of the CPUs without any notice to the console or
     the error log.
     
     Has anyone seen this bahavior or have any idea what would cause it??
     
     I apologize for the length of this message in advance!
     
     BACKGROUND:
     We have four rack-mounted Alpha 2100A 5/300 systems. They are
     identically configured:
     
        2x 21164 300 MHz CPUs
        128 MB ram
        1 RZ29B
        KZPAA SCSI controller
        1 DECchip 21040-AA ethernet controller
        DWPVC-BA PCI-VME host bus adapter
        1 TLZ09 4 mm DAT tape drive
     
        Third-party cards include:
        
        1 SCRAMNet+ PCI reflective memory card (one machine only)
        1 DCC5-P (clock card) - all machines
        1 Bit-3 PCI card - all machines
        1 VMIC VMIPCI-5588 Reflective Memory card - all machines
     
     The kernels on these systems are configured to communicate with a VME
     chassis and have several third party device drivers installed.
     
     PROBLEM:
     While troubleshooting a piece of code that required multple CPUs, I
     discovered that only one CPU was showing up on 2 of the 4 systems.
       
     I shut one system down and from the console prompt I could see both
     CPU cards, but one of them had question marks in the Module ID and
     both cards had a "F" under Status.
     
     P00>>> show config
     
     Component Status Module ID
     CPU 0 F B2040-BA DECchip(tm) 21164-5
     CPU 3 F B2040-BA DECchip(tm) ?????-?
     
     P00>>> show fru
                                Rev Events logged
        Option Part# Hw Sw Serial# SDD TDD
     
        CPU0 B2040-BA B1 34 KA708TYVWA 00 01
        CPU3 B2040-BA B1 34 KA708TYVVJ 00 01
     
     
     RESOLUTION:
     I called hardware support and they recommended the following action.
     It worked and since then we had the problem occur on a THIRD system!
     
     P00>>> clear_error all
     P00>>> init
     P00>>> show fru (write down the info)
     P00>>> set mode diag
     P00>>> build cpu3 b2040-ba ka01234567 b1 34
     P00>>> init
     
     If the question marks are still there, cycle the power and boot.
     
     This seems to fix the problem, but it it likely to happen again. I
     would like to understand how to prevent it before these systems go
     into production!
     
     Thanks in advance for your help!
     
     Brenda Ramsey
     Hughes Training, Inc.
     Arlington, Texas
     
     
     
     
     
Received on Tue Nov 11 1997 - 22:55:27 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT