SUMMARY:Error in HSZ40 from Bompreco on 1997-10-10 (tru64-unix-managers)

From: Bompreco <super3_at_svn.com.br>
Date: Thu, 09 Oct 1997 12:13:49 -0300

The problem :
>
> I have an HSZ40 cluster and i had a serious I/O problem with my
> database (Informix) in a defined chunk (rza8f partition). I found
this
> message in my /var/adm/syslog.dated :
>
> daemon.log:Sep 25 16:00:10 UXfinanceiro DECsafe: UXfinanceiro Agent
> ***ALERT: hard device error on /dev/rza8f from
> UXfinanceiro.supermar.com.br
>
> The message points to a hardware error in the volume rza8 that is a
> logical volume with 6 rz29b 4.3 GB disks, grouped in a raidset (RAID
> 5), so i can't know in which physical device the error ocurred. The
> cluster only wrotes the log message below, and didn't identify the
> device :
> Instance Code: 01010302
> Description: An unrecoverable hardware detected fault occurred.
> Reporting Component: 1.(01)
> Description:Executive Services
> Reporting component's event number: 1.(01)
> Event Threshold: 2.(02)
> Classification: HARD. Failure of a component that affects controller
> performance or precludes access to a device connected to the
> controller is indicated. Last Failure Code: 018800A0 (No Last Failure
> Parameters) Last Failure Code: 018800A0 Description:A processor
> interrupt was generated with an indication that the program card was
> removed.
> My immediate solution was don't use the partition rza8f in the
> database. But i'm loosing 2Gb (the size of rza8f) and i still can't
> identify the physical device with problem.

I could´t find any message that identify the phisycal device with the
hardware error. I tried all the logs, uerf, HSZ40 FMU(Fault Manager
Utility), hszterm commands : show disks, show failedsets, show
<everything>, and the HSZ40 didn´t assign any device with error (
flashing the error led ).
I spoke with DEC Support (Hardware/Software) and we checked the firmware
of the cluster, disks(rz29b-va), all my definitions and everything
else...with no results.

I had to use the DECevent Software (translator module), with this
command:
   dia -t s:25-sep-1997:08:00:00 e:25-sep-1997:14:00:00
output bellow:

RAIDSET State x00 NORMAL. All members present and
                                     reconstructed, IF LUN is
configured as a RAIDSET.

Error Count 1.
Retry Count 0.
Most Recent ASC x80
Most Recent ASCQ x00
Next Most Recent ASC x00
Next Most Recent ASCQ x00
Device Locator x000003 Port = 3.
                                     Target = 0.
                                     LUN = 0.
Command Opcode x28 Read (10 byte)
Original CDB
--------------------------------------------------------------
With this information i did:

CLI> locate ptl 3 0 0

So we replaced the disk and the problem disapear.
We really don´t know why the cluster didn´t assign the device with
error, but any aditional information i will summarize again.

Thanks to all who helped me out!
and Sorry my poor english.

Alberto Camardelli
camardel_at_svn.com.br
Received on Thu Oct 09 1997 - 16:56:33 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT