SUMMARY: RAID-0: block errors from Gyula Szokoly on 1997-04-09 (tru64-unix-managers)

From: Gyula Szokoly <szgyula_at_tarkus.pha.jhu.edu>
Date: Tue, 8 Apr 1997 23:01:46 -0400 (EDT)

Original question:

> In a RAID-0 array, I'm getting some block errors. How to fix it?
> In the 'good old days' one could use 'scu' to find and reassign
> the offending block (even on a hot system). What is the procedure
> for RAID? It's a 1000A, DUnix 4.0, 3 channel RAID controller card.
> I could attach all disks to a SCSI card and run SCU, but I can't
> afford the downtime. Any better suggestion?

  Only comment came from DEC, where Dr. Thomas P. Blinn tried to help
me. With RAID, life is interesting. As it seems:
1. a RAID controller pretty much does everything inside -- treat it as a
   black box. It will detect and reassign bad blocks, etc. whenever
   possible. You won't even notice it. This should be fine for rredundant
   configurations (1, 5, 0+1).
2. When you are *not* in a redundant configuration (I have level-0), the
   controller can not handle a read error (obviously -- no redundancy
   so it can't provide the requested data) so it will just report the
   condition to the OS. My message was (relevant part of it):

> Mar 27 01:05:14 taltos vmunix: AdvFS I/O error:
> Mar 27 01:05:14 taltos vmunix: Volume: /dev/re0c
> Mar 27 01:05:15 taltos vmunix: Page: 152
> Mar 27 01:05:15 taltos vmunix: Block: 346912
> Mar 27 01:05:15 taltos vmunix: Block count: 128
> Mar 27 01:05:15 taltos vmunix: Type of operation: Read
> Mar 27 01:05:15 taltos vmunix: Error: 5

3. As I understand the controller will handle write errors tranparently
   (i.e. reassign bad sectors on the fly).

  This implies what the procedure is: find the file (error message will
have instructions how to find it), and *delete* it (it's damaged anyway).
I did this, and everything is back to normal (previously defragment
died because of the error).
  There is one issue though: what happens is you have a 'write only'
sector (I had one such beast once). The controller does not verify the
data written so it will not detect the bad sector. It will when
you try to access the data, but at that point it can't do anything.
What I'm not sure about (and Dec didn't get back to me on this): will
the controller 'remember' that there was a bad sector and reassign it
next time you try to *write* that sector? If it doesn't, you are
in trouble. Under normal circumstances you either write or read a
sector, not both, so during the transaction the defect is either
not detected (write) *or* can not be dealt with (read).
  It is also possible that the blocks get reassigned upon read error, but
will marked as containing invalid data (this would be a logical and
easy solution).
  Currently I *hope* the controller did the right thing, but I will
keep an open eye on it (it is also possible that my defect was
the nice one, and even write will fail in the future on the sectors).
If this is not the case, I will have to take the whole thing off-line,
move the drive to a regular SCSI bus and use 'scu' to get rid of the
defect. There seems not to be an 'scu' equivalent for the RAID controller.

Gyula
Received on Wed Apr 09 1997 - 05:17:02 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT