I have a TruCluster 5.1A system that connects to a HSG80 controller that
has the root filesystem and several data raidsets. This weekend, my
system crashed with an error indicating a problem with a disk on one of my
raidsets. None of the disks on the HSG80 are indicated as having failed.
I had exported via NFS the filesets from one of the raidsets to a new
(linux) server we're setting up that has a large Exabyte tape library and
was doing a full backup of that raidset. The backup hung reading on one
of the filesets and the main server crashed. I booted the server and it
crashed again - I think the backup on the linux server retried the read
when the DS10 came back up and at that point I shutdown the linux server
and rebooted the DS10 and unmounted the offending fileset. I brought back
up the linux server and mounted all but the offending fileset and re-ran
my backup and it completed successfully. It seems like there's a hardware
problem on one of the disks on the HSG80 that is in a RAIDSET.
The crash log shows me this error:
_cpu: 57
_system_string: 0xffffffffffddc8b0 = "COMPAQ AlphaServer DS10 617 MHz"
_ncpus: 1
_avail_cpus: 1
_partial_dump: 1
_physmem(MBytes): 767
_panic_string: 0xfffffc0000a3a1a0 = "kernel memory fault"
_paniccpu: 0
_panic_thread: 0xfffffc002220e700
_preserved_message_buffer_begin:
further in the message log I see:
<3>drd_handle_eei: Device 68. errno 5Uninterpreted b_eei value 0x3400.
AdvFS I/O error:
Domain#Fileset: raid1#keck4
Mounted on: /keck4
Volume: /dev/disk/dsk6c
Tag: 0x00000255.8001
Page: 69061
Block: 99071168
Block count: 16
Type of operation: Read
Error: 5 (see /usr/include/errno.h)
EEI: 0x3400
AdvFS initiated retries: 0
Seconds from first I/O attempt to this failure: 15
Total AdvFS retries on this volume: 0
To obtain the name of the file on which
the error occurred, type the command:
/sbin/advfs/tag2name /keck4/.tags/597
I also got an E-mail from the Environmental monitoring system:
Formatted Message:
SCSI event
Event Data Items:
Event Name : sys.unix.binlog.hw.scsi._hwid.68
Priority : 700
PID : 524853
PPID : 524289
Event Id : 362940
Member Id : 1
Timestamp : 16-Apr-2006 10:51:27
Host IP address : 128.218.64.95
Cluster IP address: 128.218.64.31
Host Name : lehrer
Cluster Name : keckcenter
User Name : root
Format : SCSI event
Reference : cat:evmexp.cat:300
Variable Items:
_hwid (UINT64) = 68
subid_class (INT32) = 199
subid_num (INT32) = 4
subid_unit_num (INT32) = 277
subid_type (INT32) = 0
binlog_event (OPAQUE) = [OPAQUE VALUE: 1352 bytes]
============================ Translation =============================
Sequence number of error: 1471614601
Time of error entry: 16-Apr-2006 10:51:27
Host name: lehrer
SCSI CAM ERROR PACKET
SCSI device class: DISK
Bus Number: 4
Target number: 2
Lun Number: 5
Name of routine that logged the event: cdisk_complete
Event information: Status = CMP but resid not NULL
Software detected event: Possible Software Problem - Impossible Cond Detected
Event information: Hardware ID = 68
Device Name: DEC HSG80 V85F
Event information: Active CCB at time of error
Event information: CCB request completed w/out error
############### Entry End ###############
Event information: Error, exception, or abnormal condition
Event information: RECOVERED ERROR - Recovery action performed
############### Entry End ###############
======================================================================
My question is, can I fail the offending disk and then have the raidset
reconstruct with a spare disk ? Or is this problem more serious. I
gather that the file where the error occured was put there years ago
and is likely never accessed. How do I Identify the offending disk -
how do I correlate the EVM SCSI error with the raidset disk definitions
on the HSG80 controller ? It reports Target 2 Lun 5 but the definition
for the RAIDSET on my HSG80 is:
HSG80> show keck1
Name Storageset Uses Used by
------------------------------------------------------------------------------
KECK1 raidset DISK10000 D21
DISK10100
DISK10200
DISK20000
DISK20100
DISK20200
DISK30000
DISK30500
DISK40000
DISK40100
DISK50000
DISK50100
Lastly, this morning I connected to my HSG80 controller console while
gathering the information for this E-mail and at some point it produced
several error messages, mostly "Aborted Command" errors for several disks
and several error messages "Medium Error" for one particular disk (not
reporting the "Aborted Command" messages. After the slew of messages,
no more messages have appeared for several minutes.
Thanks for any help.
Dirk
Received on Mon Apr 17 2006 - 15:51:02 NZST