Hi Managers,
The system in question is:
OSF1 V5.0 1094 alpha
I've just received a whole bunch of
"EVM ALERT [700]:SCSI event"
emails from the above system and a lot of messages in the
consol log of the form:
"AdvFS I/O error"
AdvFS errors all seem to be read errors even though I know
a lot of writing is bening done to the disks while these errors
are occuring (database indexing is being done which is generating
about 5 GB of data spread over about 500 files).
Below is the details of one of the events (priority 700) from
the Event logger but there a lot of lower priority events
occuring
at the same time. My set up here is that I'm running AdvFS
domains
on top of LSM so I'm seeing AdvFS error messages, LSM warnings
and
SCSI event messages. The LSM disk group is made up of 5 35GB
disks
in a RAID0 configuartion (stripe width 5).
My problem is that I don't know where to start to try and work
out where the problem is. I think the problem may be with one
of the disks as I see that occassionally a single light (the top
one)
remains on on one of the disks. Is it just a matter of
confirming
that this is the problem and then replacing the disk. What is
the
best way to confirm a disk error?
Also, assuming that it is a single disk problem what does this
mean in terms of my filesystem configuration. To replace a disk
do I need to fully delete the current configuration and then
reinstate it with from the bottom up (i.e., re-set up the 5
disk LSM group and then re-create the AdvFS domains, etc) and
finally restore from back up? Is this the only solution?
Many thanks,
- Michael
PS: My five disk RAID0 configuration is made up of 2 7500rpm
disks and 3 10000 rpm disks. I was told by the salesman that
this was OK. Is it true?
======================= Binary Error Log event
=======================
EVM event name: sys.unix.binlog.hw.scsi
Binary error log events are posted through the binlogd
daemon, and
stored in the binary error log file, /var/adm/binary.errlog.
This
event is used to report all SCSI device errors, including
disk,
tape, HSZ raid events, and adapter errors.
======================================================================
Formatted Message:
SCSI event
Event Data Items:
Event Name : sys.unix.binlog.hw.scsi
Priority : 700
Timestamp : 03-Aug-2000 12:51:56
Host IP address : 172.18.5.4
Host Name : raphael
Format : SCSI event
Reference : cat:evmexp.cat:300
Variable Items:
subid_class = 199
subid_num = 0
subid_unit_num = 32
subid_type = 34
binlog_event = [OPAQUE VALUE: 352 bytes]
============================ Translation
=============================
Sequence number of error: 34472747
Time of error entry: 03-Aug-2000 12:51:56
Host name: raphael
SCSI CAM ERROR PACKET
Controller type: DISK
SCSI device class: DEC SIM
Bus Number: 0
Target number: 4
Lun Number: 0
Name of routine that logged the event: ss_abort_done
Event information: SCSI abort tag has been performed
======================================================================
Received on Thu Aug 03 2000 - 11:38:59 NZST