SUMMARY and UPDATE: Hardware errors.

From: Tom Fenn <tom_at_spirit.gcrc.upenn.edu>
Date: Thu, 05 Mar 1998 13:01:56 -0500

Here's my original message:


>One of our Alpha 255's (dunix 4.0b) will intermittently panic.
>So far this only occurs when there is always some heavy disk I/O
>going on (moving >100MB's to/from tape or CD)
>but it doesn't always occur when heavy disk I/O is occurring.
>
>Sometimes I can back up /usr/users, and sometimes I can't. So
>far, it's never panicked while I was backing up the partitions on
>rz0. This leads me to suspect that the rz1 disk is going bad.
>
>However, a couple of times that this has happened, the system
>had a hard time finding the boot record on the first drive. This makes
>me suspicious of rz0.
>
>The fact that I'm maybe seeing trouble with both disks makes me suspect
>that the drive controller is the culprit.
>
>So, like the hero of a bad crime drama, I have many suspicions, and
>no evidence. Can anyone give me any suggestions on how to narrow
>this down further?
>
>All of our systems are only about a year old. I'm very surprised that
>I'm having any trouble of this sort already.

I had a few messages pointing out additional possible causes for the
errors:

overheating Ronald Bowman (rdbowma_at_tsi.clemson.edu
bad scsi termination Eivind Olsen (eolsen_at_dscc.dk)
bad motherboard. Olle Eriksson (olle_at_cb.uu.se)

Alan Rollow (nabeth.cxo.dec.com) pointed me to the uerf command
and the system log files for more information. This has changed
the picture somewhat, as no disk or disk controller error messages have
been logged. I'm including below the relevant information that I have obtained.
If I'm reading the data right, the problem seems to be the CPU (bad motherboard?).

>From Kurt Carlson (snkac_at_java.sois.alaska.edu) comes information on a set of
utilities for making the output from uerf more useful. I haven't tried them out yet, but
here are the URLs:
        ftp://raven.alaska.edu/pub/sois/README.uaio
  kit: ftp://raven.alaska.edu/pub/sois/uaio-v1.9.tar.Z
He also pointed out the OS bugs could be causing the trouble, and that the OS
should have all the patches applied. (Unlikely in my case. I have 5 identical
machines, all at the same revision levels, and only one has this behavior.)

Anyway, for those with the time and inclination, here is the relevant output from
uerf for my system:

#
# These are the messages from the system startup following the panic
#

                                                  uerf version 4.2-011 (122)



********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
SEQUENCE NUMBER 0.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Mar 4 17:04:29 1998
OCCURRED ON SYSTEM alpha1
SYSTEM ID x0006000D CPU TYPE: DEC 7000
SYSTYPE x00000000
MESSAGE Alpha boot: available memory from
                                         _0x860000 to 0x7ffe000
                                        Digital UNIX V4.0B (Rev. 564); Wed
                                         _Mar 4 13:53:06 EST 1998
                                        physical memory = 128.00 megabytes.
                                        available memory = 119.64 megabytes.
                                        using 484 buffers containing 3.78
                                         _megabytes of memory
                                        AlphaStation 255/300 system
                                        DECchip 21071
                                        82378IB (SIO) PCI/ISA Bridge
                                        Firmware revision: 6.4
                                        PALcode: OSF version 1.46
                                        pci0 at nexus
                                        psiop0 at pci0 slot 6
                                        Loading SIOP: script 801900, reg
                                         _82008000, data 405cd880
                                        scsi0 at psiop0 slot 0
                                        rz0 at scsi0 target 0 lun 0 (LID=0)
                                         _(DEC RZ26F (C) DEC 630J)
                                        rz1 at scsi0 target 1 lun 0 (LID=1)
                                         _(DEC RZ29B (C) DEC 0016)
                                        rz4 at scsi0 target 4 lun 0 (LID=2)
                                         _(DEC RRD45 (C) DEC 0436)
                                        tz5 at scsi0 target 5 lun 0 (LID=3)
                                         _(DEC TLZ09 (C)DEC 0165)
                                        isa0 at pci0
                                        gpc0 at isa0
                                        ace0 at isa0
                                        ace1 at isa0
                                        lp0 at isa0
                                        fdi0 at isa0
                                        le0 at isa0
                                        le0: DEC LeMAC Ethernet Interface,
                                         _hardware address: 00-00-F8-50-60-D1
                                        tga0 at pci0 slot 13
                                        tga0: depth 8, map size 2MB, 1280x1024
                                        tga0: ZLXp2-E, Revision: 34
                                        tu0: DECchip 21040-AA: Revision: 2.4
                                        tu0 at pci0 slot 14
                                        tu0: DEC TULIP Ethernet Interface,
                                         _hardware address: 00-00-F8-23-32-C1
                                        tu0: console mode: selecting 10BaseT
                                         _(UTP) port: half duplex
                                        kernel console: tga0
                                        dli: configured


#
# These are the messages produced by the panic
#


********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Mar 4 17:01:07 1998
OCCURRED ON SYSTEM alpha1
SYSTEM ID x0006000D CPU TYPE: DEC 7000
SYSTYPE x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

                                                  uerf version 4.2-011 (122)


********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 2.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Mar 4 17:01:10 1998
OCCURRED ON SYSTEM alpha1
SYSTEM ID x0006000D CPU TYPE: DEC 7000
SYSTYPE x00000000
MESSAGE panic (cpu 0): Machine check -
                                         _Hardware error


#
# These are the startup messages from before the panic
#
# Happily, I don't see any difference between before and after.
#

********************************* ENTRY 2. *********************************

----- EVENT INFORMATION -----

EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
SEQUENCE NUMBER 0.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Mar 4 15:48:45 1998
OCCURRED ON SYSTEM alpha1
SYSTEM ID x0006000D CPU TYPE: DEC 7000
SYSTYPE x00000000
MESSAGE Alpha boot: available memory from
                                         _0x860000 to 0x7ffe000
                                        Digital UNIX V4.0B (Rev. 564); Wed
                                         _Mar 4 13:53:06 EST 1998
                                        physical memory = 128.00 megabytes.
                                        available memory = 119.64 megabytes.
                                        using 484 buffers containing 3.78
                                         _megabytes of memory
                                        AlphaStation 255/300 system
                                        DECchip 21071
                                        82378IB (SIO) PCI/ISA Bridge
                                        Firmware revision: 6.4
                                        PALcode: OSF version 1.46
                                        pci0 at nexus
                                        psiop0 at pci0 slot 6
                                        Loading SIOP: script 801900, reg
                                         _82008000, data 405cd880
                                        scsi0 at psiop0 slot 0
                                        rz0 at scsi0 target 0 lun 0 (LID=0)
                                         _(DEC RZ26F (C) DEC 630J)
                                        rz1 at scsi0 target 1 lun 0 (LID=1)
                                         _(DEC RZ29B (C) DEC 0016)
                                        rz4 at scsi0 target 4 lun 0 (LID=2)
                                         _(DEC RRD45 (C) DEC 0436)
                                        tz5 at scsi0 target 5 lun 0 (LID=3)
                                         _(DEC TLZ09 (C)DEC 0165)
                                        isa0 at pci0
                                        gpc0 at isa0
                                        ace0 at isa0
                                        ace1 at isa0
                                        lp0 at isa0
                                        fdi0 at isa0
                                        le0 at isa0
                                        le0: DEC LeMAC Ethernet Interface,
                                         _hardware address: 00-00-F8-50-60-D1
                                        tga0 at pci0 slot 13
                                        tga0: depth 8, map size 2MB, 1280x1024
                                        tga0: ZLXp2-E, Revision: 34
                                        tu0: DECchip 21040-AA: Revision: 2.4
                                        tu0 at pci0 slot 14
                                        tu0: DEC TULIP Ethernet Interface,
                                         _hardware address: 00-00-F8-23-32-C1
                                        tu0: console mode: selecting 10BaseT
                                         _(UTP) port: half duplex
                                        kernel console: tga0
                                        dli: configured

#
# Contents of /var/adm/syslog.dated/04-Mar-17:04/kern.log
#
# Edited out messages produced by the system restarting.
#

Mar 4 17:04:29 alpha1 vmunix: Alpha PC machine check type 0x670.
Mar 4 17:04:29 alpha1 vmunix: Machine check abort
Mar 4 17:04:29 alpha1 vmunix: retry = 0x0
Mar 4 17:04:29 alpha1 vmunix: mchk_code = 0x92
Mar 4 17:04:29 alpha1 vmunix: paltemp[1] = 0x1fb0
Mar 4 17:04:29 alpha1 vmunix: paltemp[2] = 0x4
Mar 4 17:04:29 alpha1 vmunix: paltemp[3] = 0x4907f628
Mar 4 17:04:29 alpha1 vmunix: paltemp[4] = 0x199
Mar 4 17:04:29 alpha1 vmunix: paltemp[5] = 0x0
Mar 4 17:04:29 alpha1 vmunix: paltemp[6] = 0x32bcc8
Mar 4 17:04:29 alpha1 vmunix: paltemp[7] = 0x4200
Mar 4 17:04:29 alpha1 vmunix: paltemp[8] = 0x400
Mar 4 17:04:29 alpha1 vmunix: paltemp[9] = 0x0
Mar 4 17:04:29 alpha1 vmunix: paltemp[10] = 0x441530
Mar 4 17:04:29 alpha1 vmunix: paltemp[11] = 0x0
Mar 4 17:04:29 alpha1 vmunix: paltemp[12] = 0x4418d0
Mar 4 17:04:29 alpha1 vmunix: paltemp[13] = 0x441900
Mar 4 17:04:29 alpha1 vmunix: paltemp[14] = 0x441960
Mar 4 17:04:29 alpha1 vmunix: paltemp[15] = 0x4416d0
Mar 4 17:04:29 alpha1 vmunix: paltemp[16] = 0x4413a0
Mar 4 17:04:30 alpha1 vmunix: paltemp[17] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[18] = 0x1fffea40
Mar 4 17:04:30 alpha1 vmunix: paltemp[19] = 0x88c0ba38
Mar 4 17:04:30 alpha1 vmunix: paltemp[20] = 0x5a4ca0
Mar 4 17:04:30 alpha1 vmunix: paltemp[21] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[22] = 0x727a7a7a
Mar 4 17:04:30 alpha1 vmunix: paltemp[23] = 0x1f2114
Mar 4 17:04:30 alpha1 vmunix: paltemp[24] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[25] = 0x10000
Mar 4 17:04:30 alpha1 vmunix: paltemp[26] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[27] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[28] = 0x1d42000
Mar 4 17:04:30 alpha1 vmunix: paltemp[29] = 0x0
Mar 4 17:04:30 alpha1 vmunix: paltemp[30] = 0x59f508
Mar 4 17:04:30 alpha1 vmunix: paltemp[31] = 0x7089a38
Mar 4 17:04:30 alpha1 vmunix: exc_addr = 0x4408da
Mar 4 17:04:30 alpha1 vmunix: exc_sum = 0x0
Mar 4 17:04:30 alpha1 vmunix: msk = 0x0
Mar 4 17:04:30 alpha1 vmunix: pal_base = 0x14000
Mar 4 17:04:30 alpha1 vmunix: hirr = 0x0
Mar 4 17:04:30 alpha1 vmunix: hier = 0x14f0
Mar 4 17:04:30 alpha1 vmunix: mm_csr = 0x1630
Mar 4 17:04:30 alpha1 vmunix: va = 0x6170
Mar 4 17:04:31 alpha1 vmunix: biu_addr = 0x4374840
Mar 4 17:04:31 alpha1 vmunix: biu_stat = 0x2440
Mar 4 17:04:31 alpha1 vmunix: dc_addr = 0xffffffff
Mar 4 17:04:31 alpha1 vmunix: fill_adr = 0x4374840
Mar 4 17:04:31 alpha1 vmunix: dc_stat = 0x3
Mar 4 17:04:31 alpha1 vmunix: fill_syndrome = 0x80
Mar 4 17:04:31 alpha1 vmunix: bc_tag = 0x404300
Mar 4 17:04:31 alpha1 vmunix: coma_gcr = 0x7fb200b4
Mar 4 17:04:31 alpha1 vmunix: coma_edsr = 0x7fb2a1e0
Mar 4 17:04:31 alpha1 vmunix: coma_ter = 0x7fb23ff0
Mar 4 17:04:31 alpha1 vmunix: coma_elar = 0x7fb2ffff
Mar 4 17:04:31 alpha1 vmunix: coma_ehar = 0x7fb21ff7
Mar 4 17:04:31 alpha1 vmunix: coma_ldlr = 0x7fb2f937
Mar 4 17:04:31 alpha1 vmunix: coma_ldhr = 0x6fb10000

ar 4 17:04:31 alpha1 vmunix: coma_base0 = 0x6fb10000
Mar 4 17:04:31 alpha1 vmunix: coma_base1 = 0x6fb10100
Mar 4 17:04:31 alpha1 vmunix: coma_base2 = 0x47ff0000
Mar 4 17:04:31 alpha1 vmunix: coma_cnfg0 = 0x47ff0049
Mar 4 17:04:31 alpha1 vmunix: coma_cnfg1 = 0x47ff0049
Mar 4 17:04:31 alpha1 vmunix: coma_cnfg2 = 0x47ff0000
Mar 4 17:04:31 alpha1 vmunix: epic_dcsr = 0x800e001d
Mar 4 17:04:31 alpha1 vmunix: epic_pear = 0x807e40
Mar 4 17:04:31 alpha1 vmunix: epic_sear = 0x173640
Mar 4 17:04:31 alpha1 vmunix: epic_tbr1 = 0x432000
Mar 4 17:04:31 alpha1 vmunix: epic_tbr2 = 0x0
Mar 4 17:04:31 alpha1 vmunix: epic_pbr1 = 0x8c0000
Mar 4 17:04:31 alpha1 vmunix: epic_pbr2 = 0x40080000
Mar 4 17:04:31 alpha1 vmunix: epic_pmr1 = 0x700000
Mar 4 17:04:31 alpha1 vmunix: epic_pmr2 = 0x3ff00000
Mar 4 17:04:31 alpha1 vmunix: epic_harx1 = 0x80000000
Mar 4 17:04:31 alpha1 vmunix: epic_harx2 = 0x0
Mar 4 17:04:31 alpha1 vmunix: epic_pmlt = 0xff
Mar 4 17:04:31 alpha1 vmunix: epic_tag0 = 0x802000
Mar 4 17:04:31 alpha1 vmunix: epic_tag1 = 0x800000
Mar 4 17:04:32 alpha1 vmunix: epic_tag2 = 0x806000
Mar 4 17:04:32 alpha1 vmunix: epic_tag3 = 0x812000
Mar 4 17:04:32 alpha1 vmunix: epic_tag4 = 0x814000
Mar 4 17:04:32 alpha1 vmunix: epic_tag5 = 0x803000
Mar 4 17:04:32 alpha1 vmunix: epic_tag6 = 0x801000
Mar 4 17:04:32 alpha1 vmunix: epic_tag7 = 0x807000
Mar 4 17:04:32 alpha1 vmunix: epic_data0 = 0x5c6
Mar 4 17:04:32 alpha1 vmunix: epic_data1 = 0x5c4
Mar 4 17:04:32 alpha1 vmunix: epic_data2 = 0x5ca
Mar 4 17:04:32 alpha1 vmunix: epic_data3 = 0x4374
Mar 4 17:04:32 alpha1 vmunix: epic_data4 = 0x617c
Mar 4 17:04:32 alpha1 vmunix: epic_data5 = 0x5c6
Mar 4 17:04:32 alpha1 vmunix: epic_data6 = 0x5c4
Mar 4 17:04:32 alpha1 vmunix: epic_data7 = 0x5ca
Mar 4 17:04:32 alpha1 vmunix: panic (cpu 0): Machine check - Hardware error
Mar 4 17:04:32 alpha1 vmunix: syncing disks... device string for dump = SCSI 0 6 0 0 0 0 0.
Mar 4 17:04:32 alpha1 vmunix: DUMP.prom: dev SCSI 0 6 0 0 0 0 0, block 131072
Mar 4 17:04:32 alpha1 vmunix: device string for dump = SCSI 0 6 0 0 0 0 0.
Mar 4 17:04:32 alpha1 vmunix: DUMP.prom: dev SCSI 0 6 0 0 0 0 0, block 131072
Mar 4 17:04:32 alpha1 vmunix:
# I killed the DUMP at this point. I was in a hurry, and didn't know whether it would
# be useful.


--------------------------------------------------------------------------------
Tom Fenn (tom_at_spirit.gcrc.upenn.edu)
Mass Spectrometry Facility Engineer phone: 215/573-9878
Center for Cancer Pharmacology fax: 215/573-9889
University of Pennsylvania
Received on Thu Mar 05 1998 - 19:02:18 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT