Yesterday one of our AlphaServers crashed twice. This morning
both did within 30 seconds of each other with the same error message.
The systems have been up and running all day now but are logging
CAM SCSI errors very frequently.
System Details are:
Digital UNIX v4.0B Rev. 564
TruCluster ASE v1.4A Rev. 95
Two AlphaServers 4000 5/466 4MB 1 Gb Memory and dual cpus in each HSZ52
SCSI controller
Have logged a call with Digital Support but we thought someone else might
know about this already. I'll post the solution when we know.
Some Interesting looking details in /var/adm/crash-data :
---------------------------------------------------------
_panic_string: 0xfffffc00006a5c60 = "kernel memory fault"
NFS3 RFS3_GETATTR failed for server ase-homep: RPC: Timed out
NFS3 RFS3_GETATTR failed for server ase-homeg: RPC: Timed out
NFS3 RFS3_GETATTR failed for server ase-homep: RPC: Timed out
NFS server: stale file handle fs(3522,460102) file 14755 gen 34070
RFS3_GETATTR, client address = 137.219.16.117, errno 22
chk_blk_quota: user/group underflow
trap: invalid memory read access from kernel mode
faulting virtual address: 0x0000000000000000
pc of faulting instruction: 0xfffffc0000279060
ra contents at time of fault: 0xfffffc0000279014
sp contents at time of fault: 0xffffffffa90c3620
panic (cpu 1): kernel memory fault
device string for dump = SCSI 1 2000 0 0 0 0 0.
DUMP.prom: dev SCSI 1 2000 0 0 0 0 0, block 262144
device string for dump = SCSI 1 2000 0 0 0 0 0.
DUMP.prom: dev SCSI 1 2000 0 0 0 0 0, block 262144
V51Z) (Wide16)
rzd18 at scsi2 target 2 lun 3 (LID=22) (DEC HSZ50-AX V51Z) (Wide16)
rz19 at scsi2 target 3 lun 0 (LID=27) (DEC HSZ50-AX V51Z) (Wide16)
rzb19 at scsi2 target 3 lun 1 (LID=28) (DEC HSZ50-AX V51Z) (Wide16)
processor at scsi2 target 7 lun 7 (LID=42) (DIGITAL) (Wide16)
>From uerf :
-------------
********************************* ENTRY 936. *********************************
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 13.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed Aug 6 07:38:36 1997
OCCURRED ON SYSTEM barra
SYSTEM ID x00070016
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000001
----- UNIT INFORMATION -----
CLASS x001F UNKNOWN
SUBSYSTEM x0000 DISK
BUS # x0000
x0000 LUN x0
TARGET x0
Regards
+---------------------------------------------------------------------+
| John Miller | Internet Mail - John.Miller_at_jcu.edu.au |
| Computer Centre |
| James Cook University of North Queensland | Phone: +61 77 815447 |
| Townsville, 4811, AUSTRALIA | Fax: +61 77 815230 |
+---------------------------------------------------------------------+
Received on Wed Aug 06 1997 - 09:36:10 NZST