KZPSC Raid disaster

From: Tim W. Janes <janes_at_signal.dra.hmg.gb>
Date: Tue, 23 Apr 1996 22:56:17 +0100 (BST)

Hi,

A week ago I asked why I got the message "WRITE BACK cache operation
NOT SUPPORTED" when I booted with with a battery backed up KZPSC PCI 3
channel RAID controller.

The answer I received was that the system could not tell if a battery
was installed and should read "WRITE BACK cache NOT SUPPORTED unless
BATTERY installed"

Well I found out that having WRITE BACK enabled is a very bad idea
even if a battery is installed - I completely lost the contents of 2
write back logical drives but one write through drive survived.
(Luckily there was nothing of value on the drives)

Here is what happened to 3 logical drives in our system

       Drive 0 Drive 1 Drive 3
        Raid 5 Raid 5 Raid 1
      Write back Write Through Write Back
                                             (system disk / )

Carried out some performance testing to compare write through/write back
Umounted drives 0 & 1
shutdown system
changed drive 1 to write back
booted system - drives 0 & 1 were NOT then mounted
next day - shut all our other systems down to install this machine as fileserver

mistake 1 - left this machine in hung state with message NFS server host
not responding so was unable to shut it down.
mistake 2 - while in this state I powered off the disks some
30 mins before the system and controller.

On powering up - all disks were marked as FAILED - I used the standalone
utility to mark them back as OPTIMAL and ran a parity check on drive
3 - this claimed to have repaired > 100 blocks. The system would not
boot ( No valid boot block) so had to re-install DU on it.

After booting used the online utility to run a parity check on drives 0
and 1 drive - drive 1 passed OK - drive 0 failed with the following

The following bad blocks were found and repaired:

  Block # Bad
  Number Blocks
---------- ----------
0000000000 0000000016

Drive 0 would not mount ( no valid filesystem) and disklabel -r claimed
disk was not labelled.

There are two aspects I find worrying

1) there is no means of testing the functionality or condition of the
battery.

2) One drive that got corrupted had been umounted for over 24hrs -
surely when a drive is umounted it is made safe and all data written
to disk. To a lesser extent, as there had been no attempt to write to
drive 3 for an hour I would have expected that to be "safe" condition.

Does anyone actually use WRITE BACK cache?

If you use WRITE BACK cache how can you ensure cache is flushed - It
appears that a clean shutdown does not achieve this?

Has anyone else lost complete drives?

All write back caches are now turned off - I just hope that our data is safe.
I don't even think about restoring 20+Gbytes from tape.

I have also noticed that if I crashed the system with the reset button
the raid controller was not visible until a power cycle.

If it matter we are running DU 3.2C and all filesystems are UFS.

Tim.

Tim Janes | e-mail : janes_at_signal.dra.hmg.gb
Defence Research Agency | tel : +44 1684 894100
Malvern Worcs | fax : +44 1684 895103
Gt Britain | #include <std/disclaim.h>
Received on Wed Apr 24 1996 - 00:23:09 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT