MORE problems with RZ 58 from Lucio Chiappetti on 1997-11-18 (tru64-unix-managers)

From: Lucio Chiappetti <lucio_at_ifctr.mi.cnr.it>
Date: Mon, 17 Nov 1997 17:42:46 +0100 (MET)

This is a follow-up to my previous mail when ...

On Thu, 13 Nov 1997, Lucio Chiappetti wrote:
> We have an Alpha 200/100 on which we have attached some disks recycled from
> our previous Ultrix systems (four RZ58 in two cabinets). The CPU is under
> maintenance contract, but the old disks are not
  [because of excessive cost of maintenance fees for such disk model]

  [message occurring during reboot]

> cam_logger CAM_ERROR packet bus 0 target 2 lun 0
> ss_perform_timeout
> (repeated 3-4 times)
> Reached max abort count, scheduled bus reset

I left the BA42 cabinet with the two disks rz2 and rz3 (both RZ58) off for the
weekend since then. This morning I checked uerf -R and found a burst of "event
type 199 CAM SCSI" errors on the day when the problem occurred, and a couple
the day the before. All of them were related to the 'rz2' disk.

This morning I tried to reboot the machine, I verified again that :

   - a "show device" at ROM level lists the disks
   - the boot sequence lists the disks

Then I booted single user, and did a /sbin/bcheckrc. This time I obtained
errors ALSO ON THE OTHER DISK IN THE SAME CABINET.

     CAM error packet bus 0 target 3
     cdisk_check_sense
     Medium error - bad block 1373633
     Hard error detected
     DEC RZ58
     Active CCB at time of error (what does this mean ?)
     Medium error not recovered
     ...
     rrz3c cannot read blk .... run fsck manually

Followed by the usual sequence of errors on the other disk rz2.

I tried a full fsck on disk rz3, and it came out full of other errors. I ran
it in -y mode to answer yes to all "repair" questions.
This did not help. At the next boot disk rz3 gave again the same sort of
errors. And another fsck -y had no effect.

We checked very carefully all cablings inside and outside the cabinet,
replaced the external SCSI cable with a different one, removed one disk
at a time from the BA42 cabinet, tried them with different SCSI addresses,
even tried with a spare BA42.

Nothing helped. We are pretty sure to exclude a problem in the SCSI controller
on the CPU or on the bus (we had originally a chain rz0 (internal) --> BA42
with rz2+rz3 --> BA42 with rz1+rz5) and also on the cabinet (which on the
other hand is so simple, just a power supply and a bunch of cables).

It is extremely curious that two disks inside the same cabinet failed more or
less at the same time !!!

However the symptoms are different. When we opened the cabinet we found one
disk was warm, and the other one cool. We found out that "rz2" (the one which
gives repeated time out during boot) is the cool one, probably it does not
even spin up to operational speed. The warm one is "rz3", the one which
gives bad sectors, but is visible to the system.

How warm shall an RZ58 disk be during normal operation ?

We also did further attempts :

  - run newfs on rz3. No errors during newfs, but a long sequel of
    sector errors in subsequent fsck.
    Does fsck do a full formatting ?

  - remove the controllers (or at least what we think are the controllers,
    the electronics card underneath the RZ58) and swap them (we thought
    rz2 had a motor problem and a good controller, and rz3 perhaps a bad
    controller)

  - rz3 with the new controller still gives sector errors, and a newfs
    behaves as above.

  - we finally tried to run "scu" and issue a "format" command to such
    rz3 (in the case newfs does not do a full formatting). We were
    unsure of what to say, we tried "format defects all", "format",
    "format defects primary" and "format defects none". In all cases
    it goes one for quite a while and terminates with

     format unit failed EIO (5) i/o error
     sense jey 0x3 MEDIUM ERROR non recovered
     sense code/qualif 0x32,0 no defect spare location

  - a test selftest or test memory from scu is OK,
    a test drive or test controller is unsupported (??), gives a

      SCSI SEND_DIAGNOSTIC failed
      EIO (5) i/o error
      sense key 0x5 illegal request
      illegal request or CDB parameter
      sense code qualifier 0x24

Should we call it a day, and consider BOTH disks irrecoverable (and buy newer
ones at the price we did not pay for maintenance) ?
Or is there anything else we can do ?

----------------------------------------------------------------------------
Lucio Chiappetti - IFCTR/CNR - via Bassini 15 - I-20133 Milano (Italy)
----------------------------------------------------------------------------
Fuscim donca de Miragn E tornem a sta scio' in Bregn
Che i fachign e i cortesagn Magl' insema no stagn begn
Drizza la', compa' Tapogn (Rabisch, II 41, 96-99)
----------------------------------------------------------------------------
For more info : http://www.ifctr.mi.cnr.it/~lucio/personal.html
----------------------------------------------------------------------------
Received on Mon Nov 17 1997 - 18:08:21 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT