kernal memory fault - two clustered AS 4000s crashed

From: John Miller <ccjwm_at_jcu.edu.au>
Date: Wed, 6 Aug 1997 17:13:58 +1000 (EST)

Yesterday one of our AlphaServers crashed twice. This morning
both did within 30 seconds of each other with the same error message.
The systems have been up and running all day now but are logging
CAM SCSI errors very frequently.

System Details are:

Digital UNIX v4.0B Rev. 564
TruCluster ASE v1.4A Rev. 95
Two AlphaServers 4000 5/466 4MB 1 Gb Memory and dual cpus in each HSZ52
                                                           SCSI controller

Have logged a call with Digital Support but we thought someone else might
know about this already. I'll post the solution when we know.


Some Interesting looking details in /var/adm/crash-data :
---------------------------------------------------------

_panic_string: 0xfffffc00006a5c60 = "kernel memory fault"

NFS3 RFS3_GETATTR failed for server ase-homep: RPC: Timed out
NFS3 RFS3_GETATTR failed for server ase-homeg: RPC: Timed out
NFS3 RFS3_GETATTR failed for server ase-homep: RPC: Timed out
NFS server: stale file handle fs(3522,460102) file 14755 gen 34070
 RFS3_GETATTR, client address = 137.219.16.117, errno 22
chk_blk_quota: user/group underflow

trap: invalid memory read access from kernel mode

    faulting virtual address: 0x0000000000000000
    pc of faulting instruction: 0xfffffc0000279060
    ra contents at time of fault: 0xfffffc0000279014
    sp contents at time of fault: 0xffffffffa90c3620

panic (cpu 1): kernel memory fault
device string for dump = SCSI 1 2000 0 0 0 0 0.
DUMP.prom: dev SCSI 1 2000 0 0 0 0 0, block 262144
device string for dump = SCSI 1 2000 0 0 0 0 0.
DUMP.prom: dev SCSI 1 2000 0 0 0 0 0, block 262144
        V51Z) (Wide16)
rzd18 at scsi2 target 2 lun 3 (LID=22) (DEC HSZ50-AX V51Z) (Wide16)
rz19 at scsi2 target 3 lun 0 (LID=27) (DEC HSZ50-AX V51Z) (Wide16)
rzb19 at scsi2 target 3 lun 1 (LID=28) (DEC HSZ50-AX V51Z) (Wide16)
processor at scsi2 target 7 lun 7 (LID=42) (DIGITAL) (Wide16)


>From uerf :
-------------
********************************* ENTRY 936. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 13.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Aug 6 07:38:36 1997
OCCURRED ON SYSTEM barra
SYSTEM ID x00070016
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000001

----- UNIT INFORMATION -----

CLASS x001F UNKNOWN
SUBSYSTEM x0000 DISK
BUS # x0000
                              x0000 LUN x0
                                        TARGET x0



Regards
+---------------------------------------------------------------------+
| John Miller | Internet Mail - John.Miller_at_jcu.edu.au |
| Computer Centre |
| James Cook University of North Queensland | Phone: +61 77 815447 |
| Townsville, 4811, AUSTRALIA | Fax: +61 77 815230 |
+---------------------------------------------------------------------+
Received on Wed Aug 06 1997 - 09:36:10 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT