SUMM: RAID 5 failures with KZPAC & RZ1CB

From: Joao Rochate <jrochate_at_ualg.pt>
Date: Wed, 17 Mar 1999 19:00:14 +0000

NOTICE: This is a problem about RAID disk failing.
It is a long message, but fully describes the problem, tests and solution.
        ^^^^^^^^^^^^

------8<--------8<------------8<-----------8<----------8<----------8<----
1. Description of the problem
1.1 Message sent to list
2. Possible reasons
3. Solution found

1.
I bought from Compaq a PCI RAID Card and 2 UW blue disks
I had a shelf and 3 UW disks, all green.
After a while, the disks start failing several times, until they RAID fails
and I have to rebuild it.
1.1
>I've just bought a KZPAC-CB with battery backup (RAID 3-channel) and 2
DS-RZ1CB-VW (4,3Gb UW 40Mb/s - blue ones).
>I umounted the BA356-SB (StorageWorks) from the 4100 and made a split bus.
>Already on the machine were 2 RZ29B-VW (4,3Gb UW 20Mb/s - green ones) and
1 RZ1CB-VW (also blue).
>
>So I had a RAID card with 3 channels and 5 4,3GB HDs to play. I set a
RAID5 Logical Drive 0 on Channel 0 and Channel 1.
>
>Re-installed the system and everything seemed ok!
>
>One day after, it fails one of the blue disks. I haven't installed the
SWXCR monitor yet. Rebuilt it.
>Installed the monitor, because I wasn't expecting it to fail so soon!!
>After 4 days, same drive fails again. Rebuilt it. After 2 hours same drive
fails. Again rebuilt...
>
>Some minutes after, another blue drive has misc errors.
>After 4 hours, that drive fails!
>
>And after some hours the blue drive that had the lastest misc error, also
fails.
>
>And so, the system goes down with 2 drives from the RAID 5 failed!!
>
>Well, after some work I made the system optimal again. There is still one
drive that formats OK, but fails when rebuilding.

>


2.
There were several opinions:
(mine): speed of the new disks
* Bad jumper configuration
* Sofware/firmaware issues
* Lack of blue 180 Watt power supplies
* disk firmware
* Speed of BUS
* Shelf FAN
* Shelf does not support UW disks

3.
AND THE SOLUTION WAS.......

I lowered the speed of the BUS to 10Mhz and it NEVER failed again!

4. Thanks goes to:
Specially to: Eric <Eric.Rostetter_at_utoledo.edu>
By order of appearance:
Neil Dyce <Neil.Dyce_at_bristol.ac.uk>
Simon Greaves <Simon.Greaves_at_usp.ac.fj>
"Partin.Kevin" <KPartin_at_hou.mdc.com>
John Seel <john.seel_at_us.faulding.com>
"Holmberg, Viktor" <Viktor.Holmberg_at_abnamro.co.uk>
Michael Polnick <polnick_at_pdv-sachsen.net>

=========================================================================

Date: Mon, 08 Feb 1999 08:18:38 -0500 (EST)
From: Eric <Eric.Rostetter_at_utoledo.edu>
Subject: Re: RAID 5 failures with KZPAC & RZ1CB
To: jrochate_at_ualg.pt

I had the same problem. Almost exactly. The problem was the speed setting
in the raid configuration for the channels going to non-utlrascsi certified
drive bays. I reset the channels that went to non-ultrascsi certified
bays to 10 MHZ, leaving those that went to ultrascsi certified bays at
"MAX" and everything has been good since then.

I would suspect that your ba-356 isn't ultrascsi complient, *OR* that you
have too long of a cable length (almost impossible if this is really
ultrascsi complient and you user the 68 pin ultrascsi cables).

Try setting the channel speed to 10 MHZ, or buy a new ultrascsi cabinet
like the ba356-kf. My experience is the ba356-kf works great. The other
option is to get a "Scsi buddy" or other bus extender from DEC and see
if that helps. You might need one that goes between the controller and
the shelf (like the scsi buddy) rather than one that is an SBB since the
signal problem *probably* occurs before the shelf.


> DEC sent several times a person to fix the problem. After 15 days of
> non-sucess, DEC just put 2 used RZ29B (green one) and the system NEVER had

> a problem since then..

That would do it, since it would not then run at more than 10 MHZ.

> Maybe that's what I should do with mine?!

You'll probably get slightly better performance by keeping the newer drives
and setting the bus to 10 MHZ. You'd get much better performance with the
new drives and a ba356-kf, or by getting a "scsi buddy" or other bus
extender.

> Please, if someone has some experience like this one, Email-me and I will
> SUMM it ..
>
> This is a very BAD thing from Compaq and we are very unhappy with that...
> The used drives are still here, and the 4100's are a near-critical machines.

Yep. Compaq claims the new drives are direct replacements for the old ones.
But they forget about making sure the speeds of the drives are correct for

the controller and bus.

My advice, go into the rcu and see what the channel speeds are set to. If
they are "MAX" or above 10 Mhz, slow them to 10 Mhz. Means down time of
course as the rcu must be run with the system down, but the whole thing
(on one of my systems) takes about 15 minutes (shutdown, boot rcu, make
change, save changes, reboot).

--------->8------------------>8------------------>8--------------->8---------

Cheers to all,
                                Joao Rochate

-------------------------------------------------------
Joao Pedro Rochate | EMail: jrochate_at_ualg.pt
Servicos de Informatica | URL: w3.ualg.pt/~jrochate
Universidade do Algarve | Phone: +351 (0)89 800 961
8000 Gambelas - FARO | ISDN: +351 (0)89 860 125
P O R T U G A L (pt) | GSM: +351 (0)931 950xxxx
-=[ http://www.ualg.pt ]=- | Fax: +351 (0)89 860 129
-------------------------------------------------------
Eng. de Sistemas e Computacao - UCEH - Univ. do Algarve
Received on Wed Mar 17 1999 - 19:03:42 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT