identifying failing disks from dr john on 2007-03-21 (tru64-unix-managers)

From: dr john <john_at_frumious.unidec.co.uk>
Date: Wed, 21 Mar 2007 02:18:44 +0000 (GMT)

This is not a question I'd normally bring up here but I'm a bit stuck as
a) the machine in question is in another country and b) the only access I've
got to it is by getting someone who knows nothing about tru64 to type in
whatever I tell them and email it to me. The system is an AS800 with a BA356
disk array (7 disks apparantly), runing 4.0B. From what I've been able to
gather from the uerf output I can see (relevent bits only posted)
MESSAGE Alpha boot: available memory from
                                         _0x19d8000 to 0x3ffce000
                                        Digital UNIX V4.0B (Rev. 564); Sun
                                         _Mar 19 14:55:20 WET 2000
                                       scsi0 at isp0 slot 0
                                        rz1 at scsi0 target 1 lun 0 (LID=0)
                                         _(Quantum XP39100W LXTA)
                                         _(Wide16)
                                        rz4 at scsi0 target 4 lun 0 (LID=1)
                                         _(DEC RRD46 (C) DEC 0557)
                                        Initializing xcr0. Please wait....
                                        xcr0 at pci0 slot 13
                                        re0 at xcr0 unit 0 (unit status =
                                         _ONLINE, raid level = 5)
                                         (WRITE BACK cache operation SUPPORTED
                                         _if battery backup enabled)
                                        re1 at xcr0 unit 1 (unit status =
                                         _CRITICAL, raid level = 5)
                                         (WRITE BACK cache operation SUPPORTED
                                         _if battery backup enabled)

...which worries me slightly. There then follows entries in the log at
roughly 10 second intervals as such:
----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 2.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sat Nov 18 13:06:01 2006
OCCURRED ON SYSTEM eik
SYSTEM ID x0007001B
SYSTYPE x00000000

----- UNIT INFORMATION -----

CLASS x0022 DEC SIM
SUBSYSTEM x0000 DISK
BUS # x0000
                              x0008 LUN x0
                                        TARGET x1

This is obviously reporting bad blocks on a disk, which are accessed very often.
My question is: am I right in assuming that the disk with the problem is part
of the raid array re1? (I'd assume that bus 8, lun 0 target 1 whould be the
first disk in the cabinet). Also, is there any definitive way of working out
which RAID controller is in the machine - not being within 1000 miles of it
makes it a bit difficult to ascertain visually and I can't easily get the
machine taken offline for someone to check. Any pointers would be very much
appreciated.

Regards
John
Received on Wed Mar 21 2007 - 02:11:54 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT