SUMMARY(2): swcxr rz28s tripping offline

From: Paul M. Aoki <aoki_at_CS.Berkeley.EDU>
Date: Wed, 15 Jan 97 16:40:58 -0800

well, i said i'd report back when i'd had a chance to muck with this
some more...

i brought the 2100/A500MP (aka 4/200) down over winter break and
checked out the firmware. the "old" (read: working) disks were all
rz28's at f/w rev 441C, and the two swxcr controllers were at f/w rev
1.99 (!). the "new" disks (added later) were a combination of rz28's
(at f/w revs 442D and T436) and rz28m's (at f/w rev 568). the "right
answer" was obviously to update all of the f/w, but that option gave
me the willies since i didn't have enough cold spares to recover
smoothly if a large number of disks failed after update. instead, i
moved things around so that the "new" disks were on a regular scsi
controller and the swxcr-controlled racks contained identical disks
(all at 441C). this configuration appears to be stable.

so: the advice about using identical disks at identical f/w levels
appears to be very sound. (at least, if you have old swxcr f/w; we'll
update f/w when we do the next operating system upgrade.)

the following things do not appear to have been factors, or didn't
work:

1. i don't think logical disk initialization was an issue for us,
since we have only been using JBOD; our last sysadmin never
initialized the disks he installed, and uninitialized disks at the
right f/w level have been problem-free (aside from the constant
complaints from swxcrmgr). mind you, *i* have been initializing new
disks. i'm only mentioning this because it might help someone else
who is troubleshooting.

2. the other thing i tried (as a quick hack, to check out the spindown
theory) was a script that touched each swxcr-controlled file system
once every X seconds and then sync'd. X=5 might have been working --
at least, the swxcr didn't fail any disks for the week or so while i
was using that setting. X=60 definitely didn't work.

3. when i moved the "new" drives to a standard scsi bus, i did a media
verification using scu and didn't find any bad blocks. my original
hypothesis (just plain bad disks) doesn't hold up.

thanks again to:

        Todd Acheson <acheson_at_oak.cats.ohiou.edu>
        Javier Aida <jaida_at_gmd.com.pe>
        tra_at_ucolick.org (Ted Asocks)
        "Edward C. Bailey" <ed_at_pigdog.niehs.nih.gov>
        "Dave Golden" <golden_at_falcon.invincible.com>
        "William H. Magill" <magill_at_isc.upenn.edu>
        Kevin Reardon <kreardon_at_cerere.na.astro.it>
        Eric.Rostetter_at_utoledo.edu
        sanghvi_at_proto.wilm.ge.com (arun sanghvi)
        Tom Webster <webster_at_europa.mdc.com>

------- Forwarded Messages

 From: aoki_at_CS.Berkeley.EDU (Paul M. Aoki)
 To: alpha-osf-managers_at_ornl.gov
 Subject: swcxr rz28s tripping offline
 Date: Mon, 02 Dec 96 14:11:15 -0800

 i recently inherited a 2100/500MP and have been tangling with the
 %$#_at_ swxcr ever since. here's the story:
 
 the last sysadmin popped a bunch of new (in the sense that they are
 replacements provided by dec in sealed plastic bags --
 remanufactured?) rz28's and rz28m's into the machine, marked them as
 optimal, disklabelled them as 'rz28' (or left them unlabelled) and
 then newfs'd them. the disks run for a few days, then trip offline
 suddenly. sometimes there are references to bad block reads (e.g.,
 during dumps). recovery requires use of swxcrmgr to reset them as
 'optimal', but this doesn't last very long, which is annoying.
 
 i've seen reference in the archives to the following:
 
 - you need to low-level format the drives before putting them under
   swxcr control
 - the rz28's need to have certain firmware revs before being put under
   swxcr control
 - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 i suspect these things just have a bunch of bad blocks or are
 otherwise defective, but if anyone can:
 
 - refer me to the documentation that says the things above
 - provide some good war stories that might provide clues
 
 i'd appreciate it greatly. things were more or less fine until this
 last batch of disks went in, which suggests that either we got a bunch
 of defective disks (of varying model) or the last sysadmin forgot a
 step.
 --
   Paul M. Aoki | University of California at Berkeley
   aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
                        | Berkeley, CA 94720-1776

------- Message 2

     If the drives attached to the swxcr are set up with any RAID level (ie,
 not set up as JBOD), you most definitely need to use swxcrmgr (or srlmgr)
 to initialize the drives. I am not sure if you have to have swxcrmgr
 initialize JBOD drives, but it wouldn't suprise me...
 
------- Message 3

 I can't be explicit with documentation, but with the experience i've
 had with around 30 installations involving SWXCR, i can give you some
 clues:
 
 - An old firmware version on the controller ( if PCI, you'll need
  2.36, if EISA, you'll need 2.16) will cause very, very weird
 problems.
 - When you created the units (RAID 0, 1, 5) or JBOD on SWXCRMGR, you
 needed to init them as indicated in the SWXCR manual. A failure in
 doing this may cause you unexpected problems while accessing disks,
 even after working with them for some time.
 - If you have disklabeled the disks as RZ28, you're wrong, you'll need
 to init them as "swxcr". For example:
 
         # disklabel -rw re2 swxcr
 
 Hope this helps
 
------- Message 4

 I also run into this problem. I manage to get a firmware
 upgrade floppy from my FE but have not been sucessful in
 upgrading the firmware.
 
------- Message 5

 I have a similar setup, except my OS is on the re device. Our SWXCR
 problems resulted in system hangs and crashes. Finally dug up a
 patch that solved the problem (on OSF/1 3.2). If you don't have
 this patch installed and are at 3.2 I'd recommend it. Hope this
 helps.

 PATCH ID: OSF320-228 Subset(s): OSFBINCOM320,OSFHWBIN320
 Supersede OSF320-137, OSF320-193
 
------- Message 6

 What RAID configuration are you running the disks under: 0, 0+1, 3, 5,
 JBoD?
 
 From the description of the disks as being labeled 'rz28', it sounds
 like you are setup for JBoD (Just a Bunch of Disks), which is a method
 of just hooking a bunch of disks onto the RAID controllers and trying
 to use them as noral disks. I've never tried it, so I can't really
 offer and advice if this is your configuration.
 
> i've seen reference in the archives to the following:
>
> - you need to low-level format the drives before putting them under
> swxcr control
 
 The swxcrmgr (or srlmgr) software should take care of this for you,
 the biggest problem is that you have to halt the system and run it from
 the console. See my comments on the standalone config utility below....
 
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
 
 Check with DEC if you can. For most of the rzXX disks, firmware is now
 considered customer installable and if you are on a service contract
 they will make the images available for DL. Make sure they are aware
 that the disks will be used with a hardware RAID controller. I was never
 clear if this was a DEC firmware problem or a Seagate (used as 3rd party
 disks), but there were rumors of certain firmware versions doing an
 'energy star' spin-down after a period of inactivity. The swxcr would
 see the disk had stopped spining and fail it. (Again this is just a
 rumor as far as I can verify.)
 
 One important thing it to try to make sure that all of the drives are
 at the same firmware level.
  
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 The 're' disks are variable geometry disktabs, used for arrays who's
 size is computed rather than fixed. Just take a quick look in your
 /etc/disktab to get an idea of how it differs from normal disktab
 entries.
 
> > i suspect these things just have a bunch of bad blocks or are
> otherwise defective, but if anyone can
 
 I don't really know how well you know the swxcr stuff, so forgive
 me if I'm being a little rumedial.
 
 In order to do anything more than parity checking and rebuilding
 drives (after replacing the failed drive), you are going to have
 to take the system down and use one of the stand-alone configuration
 utilities from the console. The latest version of these utilites that
 I am aware of is 3.11 and you need to use that to be compatable with
 the later versions of DU 3.2. The utilites should have come on
 a floppy disk (along with instructions).
 
 If you can't find the floppy, all is not lost, the later versions of
 of the Firmware CD's have the utilites.
 
 I should say something about which utility to use: if you have a
 graphics head and a PC style keyboard, use the swxcrmgr.exe utility.
 If you are using a serial terminal for a graphics head: connect a
 PC or Mac to the serial line (via a null modem and your choice of
 terminal emulation software), and use the srlmgr.exe utility.
 
 The swxcrmgr software expects a graphic head and runs ~200-1,000%
 slower on a serial head. The srlmgr utility was designed for
 use with serial heads and works well in this mode, but still expects
 a PC style keyboard.
 
 So, now you are thinking to yourself: OK, I have the firmware CD
 (we'll use the v3.5 CD as an example), but how do I run a program
 off of it from the SRM console? Here is how we run srlmgr on our
 8400 (which has no grphics head):
 
 1. Halt the system down to the SRM Console (">>>" prompt).
 
 2. Do a "sho dev" to figure out the device name of your
    CD-ROM drive, i.e. DKd500.
    
 3. Load support for iso9660 file systems (if your SRM console
    can't do this, you are going to have to make a floppy).
    
>>> load -f iso9660
    
 4. List the directory, to make sure you are in the right place....
 
>>> ls iso9660:[CDROM.UTILITY]/DKd500
    
>>> ls iso9660:[CDROM.UTILITY.SWXCRMGR]/DKd500
    
    . . .
    
    ad nausium....
    
 5. Run the srlmgr. (The -p1, tells it to look for the controllers
    on the first PCI bus.)
 
>>> run iso9660:[CDROM.UTILITY.SWXCRMGR]SRLMGR.EXE -d DKd500 -p1
    
 NOTE: The drive name in the form "DKd500" IS case sensitive and it
 will cause commands to fail if you mess up the case on the drive
 name.
 
 If you need to make floppies, you can copy the utilites to a FAT
 formatted disk using mtools or a PC with a CDROM drive.
 
 Hope this helps,

------- Message 7

 Wow, I was just about to write a message to the mailing-list about the same
 problem! I have had the same problem with _one_ disk out of five (rz29b-vw,
 all unlabelled) that we have installed identically in a SWXCR of a 2100
 4/275. The fact that this happens with only one disk, even in different
 slots of the RAID controller, leads me to believe it is a problem with the
 disk itself being faulty.
 
 I found that, like you said, if I only re-optimized the disk with swxcrmgr
 after a "failure", it would be failed again soon thereafter. Instead, if I
 optimize, unmount the disk, run fsck, and remount it, it seems to work
 fine...for a while. I have had the problem three times with this disk in
 about six months. I want to have the disk replaced, especially before its
 warranty runs out, but the first time the Digital technician came, switched
 the disk to another slot in the RAID controller, and said, "the disk is
 fine, no reason to replace it." So if I can be sure it is a problem with
 the disk itself, I can push to get a replacement.
 
 I assume some error develops on the disk, the SWXCR reaches a limit in the
 number of errors allowed from the disk, and fails it (see section 5.2 of
 the "StorageWorks RAID Array 200 Online Managerment Utility for Digital
 Unix"). I now have swxcrmon running so I'll see the error messages next
 time it fails, but that could be a couple of months!
 
 Anyway, sorry I can't be of much help other to say you are not alone!
 Please let us know what you find out. Thanks.

------- Message 8

> - you need to low-level format the drives before putting them under
> swxcr control
 
 Couldn't hurt.
 
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
 
 I could probably be more helpful if you could go into the SWXCR utility
 and get the drive model and firmware rev. from the drive information
 screen.
 
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 This might mess up your filesystem, but it would not make the drives
 go offline.
 
> i'd appreciate it greatly. things were more or less fine until this
> last batch of disks went in, which suggests that either we got a bunch
> of defective disks (of varying model) or the last sysadmin forgot a
> step.
 
 I'm assuming that they are all set up as individual disks (JBOD).
 FYI -- there is also a utility which runs under digital UNIX which will
 allow you to do some monitoring of the swxcr and mark the drives
 optimal. It comes on one of the floppies in the white StorageWorks
 RAID Array 230 subsystem software box. If you don't have that software
 kit, the part number is qb-2xhah-sb. It's under $200 bucks and contains
 all the manuals, software, etc.
 
 Good luck,
 
------- Message 9

 I have posted my detailed experience with RZ29s in SWXCR that had bad
 firmware (older than 0014)(see the archives). They would time out
 basically. The RZ29s have some kind of power saving option that would
 cause timeouts. The solution was to upgrade the firmware. It fixed
 the problem.
 
 I haven't used RZ28s in the SWXCR so don't know if similar problems
 exist.
 
 I have been told to make sure that all drives have the same level of
 firmware - don't mix levels.
 
 I would contact DEC to determine the current latest version of
 firmware for the RZ28s you are using and then check the drives to make
 sure that are all up to date.
 
 You can use scu to read the firmware level, but not in the swxcr
 array. As far as I know you have to connect them to one of the system
 scsi controllers to use scu. I had a storage works pizza box on my
 external scsi connection that I used.
 
------- Message 10

> - you need to low-level format the drives before putting them under
> swxcr control
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 There is indeed a certain level and/or type of rz28 needed to work with
 the swxcr controller... I can't find my doc on it (I know I have it
 around here somewhere) but DEC ought to be able to tell you what the
 minimum version of rz28 is that works with the swxcr controller.
 
> - refer me to the documentation that says the things above
> - provide some good war stories that might provide clues
 
 I have (somewhere) a doc on what level of rz28 you need, but other than
 that, all I can say is I have a bunch of rz28 and rz29 drives (about 34
 in total, over half rz28 drives) on two 2100's (a 4/200 and a 4/275)
 both with swxcr controllers, and have never had a problem except for
 two disk failures (both within a week of installation, one on each machine --
 since we installed them 16 at a time, I don't see 2 bad out of 32 drives,
 both failing within a week after installation, and no problems since then,
 as a big problem).
 
------- Message 11

 We just went through hell with this problem.
 It IS a Firmware problem with the drives.
 
 The RZ28 and RZ29 drives (Seagates I think) "spin down," when not accessed.
 
 I don't recall the gory details but I think they are 7200 RPM drives that
 run hot. In order to use them in SW enclosures they were allowd to spin
 down to ?5200? RPM when not accessed in some period of time (very short,
 like seconds). When the SWXCR goes to access them, it thinks they are
 down, because they are "not ready".
 
 Naturally I don't recall the rev levels, but the drives MUST be at the
 highest level - and the ones DEC shipped us in September must have come
 from the "back of the shelf" because they were out of rev. The drives are
 only a problem in the SWXCR controller.
 
------- Message 12

> still haven't checked the firmware levels (i don't know of any way to
> do that with the drives still in the internal shelves) but that will
> pop to the top of my list when i take the machine down.

 That's one of the reason's I've come to hate the swxcr.
 It is really a damm PC part and doesn't integrate at all with DU.
 The only thing I can say about it is - when it works, it works.
 It's hell when it doesn't because Field Engineering doesn't know
 squat about it either and now that Digital sold off Storage Works
 it is a real pain in the butt.
 
 By the way, one other aspect of this - I believe that this problem is
 specific to VW drives; ie fast wide.
 
------- Message 13

 I wasn't clear on what style RAID you were using.
 
 We use RAID level 5 with "hot spares".
 
 If you use the same you could pull drives from the live machine to
 check firmware levels - though this would degrade performance.
 
 DEC is pretty flaky about the firmware issues.
 
 Also, as others have mentioned be sure the SWXCR firmware is up to
 date.

------- End of Forwarded Messages

--
  Paul M. Aoki         | University of California at Berkeley
  aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
                       | Berkeley, CA 94720-1776
Received on Thu Jan 16 1997 - 01:58:20 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT