SUMMARY: swcxr rz28s tripping offline

From: Paul M. Aoki <aoki_at_CS.Berkeley.EDU>
Date: Tue, 03 Dec 96 13:26:20 -0800

many thanks to the following folks (in alphabetic order, not message
order; call me old-fashioned, i always feel kind of uncomfortable
forwarding other people's mail) for their speedy replies!

        Todd Acheson <acheson_at_oak.cats.ohiou.edu>
        Javier Aida <jaida_at_gmd.com.pe>
        tra_at_ucolick.org (Ted Asocks)
        "Edward C. Bailey" <ed_at_pigdog.niehs.nih.gov>
        "Dave Golden" <golden_at_falcon.invincible.com>
        Kevin Reardon <kreardon_at_cerere.na.astro.it>
        Eric.Rostetter_at_utoledo.edu
        sanghvi_at_proto.wilm.ge.com (arun sanghvi)
        Tom Webster <webster_at_europa.mdc.com>

this has given me a bunch of stuff to check out. i'll report back
when (if :-) i figure out exactly what's up.

thanks again to all!

------- Forwarded Messages

 From: aoki_at_CS.Berkeley.EDU (Paul M. Aoki)
 To: alpha-osf-managers_at_ornl.gov
 Subject: swcxr rz28s tripping offline
 Date: Mon, 02 Dec 96 14:11:15 -0800

 i recently inherited a 2100/500MP and have been tangling with the
 %$#_at_ swxcr ever since. here's the story:
 
 the last sysadmin popped a bunch of new (in the sense that they are
 replacements provided by dec in sealed plastic bags --
 remanufactured?) rz28's and rz28m's into the machine, marked them as
 optimal, disklabelled them as 'rz28' (or left them unlabelled) and
 then newfs'd them. the disks run for a few days, then trip offline
 suddenly. sometimes there are references to bad block reads (e.g.,
 during dumps). recovery requires use of swxcrmgr to reset them as
 'optimal', but this doesn't last very long, which is annoying.
 
 i've seen reference in the archives to the following:
 
 - you need to low-level format the drives before putting them under
   swxcr control
 - the rz28's need to have certain firmware revs before being put under
   swxcr control
 - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 i suspect these things just have a bunch of bad blocks or are
 otherwise defective, but if anyone can:
 
 - refer me to the documentation that says the things above
 - provide some good war stories that might provide clues
 
 i'd appreciate it greatly. things were more or less fine until this
 last batch of disks went in, which suggests that either we got a bunch
 of defective disks (of varying model) or the last sysadmin forgot a
 step.
 --
   Paul M. Aoki | University of California at Berkeley
   aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
                        | Berkeley, CA 94720-1776

------- Message 2

     If the drives attached to the swxcr are set up with any RAID level (ie,
 not set up as JBOD), you most definitely need to use swxcrmgr (or srlmgr)
 to initialize the drives. I am not sure if you have to have swxcrmgr
 initialize JBOD drives, but it wouldn't suprise me...
 
------- Message 3

 I can't be explicit with documentation, but with the experience i've
 had with around 30 installations involving SWXCR, i can give you some
 clues:
 
 - An old firmware version on the controller ( if PCI, you'll need
  2.36, if EISA, you'll need 2.16) will cause very, very weird
 problems.
 - When you created the units (RAID 0, 1, 5) or JBOD on SWXCRMGR, you
 needed to init them as indicated in the SWXCR manual. A failure in
 doing this may cause you unexpected problems while accessing disks,
 even after working with them for some time.
 - If you have disklabeled the disks as RZ28, you're wrong, you'll need
 to init them as "swxcr". For example:
 
         # disklabel -rw re2 swxcr
 
 Hope this helps
 
------- Message 4

 I also run into this problem. I manage to get a firmware
 upgrade floppy from my FE but have not been sucessful in
 upgrading the firmware.
 
------- Message 5

 I have a similar setup, except my OS is on the re device. Our SWXCR
 problems resulted in system hangs and crashes. Finally dug up a
 patch that solved the problem (on OSF/1 3.2). If you don't have
 this patch installed and are at 3.2 I'd recommend it. Hope this
 helps.
 
 Here is the README:
 
 ===================================================================
 
 To apply the updated o-files, .c and .h files, do the following:
 
   - Please place a copy of this README.patch_id into a directory called
     /etc/patches for future reference on what patches are installed on this
     machine as these patches will have to be removed before upgrading to
     a newer version.
   - For Digital UNIX V4.0 binary files, please refer to the patch README and
     Appendix B for special installation instructions.
   - For OSF V3.2g and below, please follow the below instructions and any
     other special instructions in the patch README.
   - make a backup copy of the existing /sys/BINARY/o-files.
     if .o files are included.
   - make a backup copy of the existing .h file(s) as indicated in
     list of files if .h file(s) are included.
   - make a backup copy of the existing .c file(s) as indicated in
     list of files if .c file(s) are included.
   - copy all the o-file(s) to the host machine's /sys/BINARY/
     directory.
   - copy all the .h file(s) to the appropriate directory.
   - copy all the .c file(s) to the appropriate directory.
   - rebuild the host machine's kernel using doconfig (-c $HOSTNAME).
   - copy the new kernel from /sys/$HOSTNAME/vmunix to /vmunix
   - reboot the host machine
 
 Checksums (produced by the "sum" command) are listed below.
 (Notice - The sum for the README file may not match.)
 
 README for Specific patches.
 
 ===============================================================================
 
 /usr/sys/include/io/dec/eisa/xcr_port.h subset OSFBINCOM320
 CHECKSUM: 09687 14 RCS: 1.1.20.2 (xcr_port.h)
 /usr/sys/include/io/dec/eisa/re.h subset OSFBINCOM320
 CHECKSUM: 28791 13 RCS: 1.1.16.2 (re.h)
 /usr/sys/BINARY/re_driver.o subset OSFHWBIN320
 CHECKSUM: 25716 52 RCS: 1.1.17.3 (re_driver.c)
 /usr/sys/BINARY/xcr_port.o subset OSFHWBIN320
 CHECKSUM: 52715 67 RCS: 1.1.26.4 (xcr_port.c)
 /usr/sys/BINARY/xcr_logger.o subset OSFHWBIN320
 CHECKSUM: 03933 20 RCS: 1.1.4.4 (xcr_logger.c)
 ---------------------
 
 PATCH ID: OSF320-228 Subset(s): OSFBINCOM320,OSFHWBIN320
 Supersede OSF320-137, OSF320-193
 
 Installation Instructions: A kernel rebuild is required.
 
 
 PROBLEM: (QAR 29902) (Patch ID: OSF320-137)
 ********
 Installation to a disk connected by a PCI Raid disk controller
 can fail with the following errors:
 
 The following is the output of the session on the AlphaServer 2000
 after selecting an advanced installation to re0
 (device 0 on PCI, controller 0):
                                 .
                                 .
          Use SWXCR, re0, for your system disk? [] y
                                 .
                                 .
          Select the file system type for the root file system
          (advfs/ufs):[ufs]advfs
 
          Initializing the System Disk SWXCR, re0.....
          XCR_logger: XCR_ERROR packet
          XCR_Logger: cntrl 0 unit 0
          re_complete
          I/O failed
          Hard Error Detected
          ACTIVE XCR_COM at time of error
          ACTIVE CONTROLLER working set at time of error
          [repeated several times and then hangs indefinitely]
 
 The following is the output of the session on the AlphaServer 1000
 after selecting an advanced installation to re0
 (device 0 on EISA, controller 0):
                                 .
                                 .
          Use SWXCR, re0, for your system disk? [] y
                                 .
                                 .
          Select the file system type for the root file system
          (advfs/ufs):[ufs]advfs
 
          Initializing the System Disk SWXCR, re0.....
          ERROR:
          The installation procedure is unable to initialize the
          system disk label
 
          disk label diagnostics:
          disklabel: ioctl DIOCWDINFO: no disk label on disk.
          use "disklabel -wr" to install initial label
 
 
 PROBLEM: (QAR 30337) (Patch ID: OSF320-137)
 ********
 Astro errors have been seen under V3.2 and V3.2B. The errors may be followed
 by a panic. The database which is being built is always corrupted by the error.
 
 The Astro errors look like this
 ...........................................................................
 
 EVENT CLASS ERROR EVENT
 OS EVENT TYPE 198. ASTRO CONTROLLER
 ROUTINE NAME xcrintr
 ----- CAM STRING -----
                                         No SLOT_CMD_ACTIVE bit set
 ...........................................................................
 
 EVENT CLASS ERROR EVENT
 OS EVENT TYPE 198. ASTRO CONTROLLER
 ROUTINE NAME xcr_cmd_timeout
 ----- CAM STRING -----
                                         Command has timed out
 
 
 PROBLEM: (DMO100161, QAR39130) (Patch ID: OSF320-193)
 ********
 System panics with "xcr_que_insert list corruption"
 
 Stack trace:
 
> 0 stop_secondary_cpu() ["../../../../src/kernel/arch/alpha/cpu.c":352, 0xffff fc00004d9598]
[...]
   14 _Xsyscall(0x8, 0x3ff800d5e08, 0x1400099e0, 0x3, 0xc0207605) ["../../../../src/kernel/arch/alpha/loc
 ore.s":1086, 0xfffffc00004dca44]
 
 
 PROBLEM: (QAR 41490) (Patch ID: OSF320-228)
 ********
 In some adverse situtations the SWXCR controller may hang.
 
 There is a SWXCR Driver timing issue where a rebuild in process can
 cause the controller to delay I/O completion and timeout commands.
 The driver will start to reset the controller and eventually the
 controller will hang. This is evident by the fact that I/Os issued
 to that SWXCR controller do not complete.
 
------- Message 6

 What RAID configuration are you running the disks under: 0, 0+1, 3, 5,
 JBoD?
 
 From the description of the disks as being labeled 'rz28', it sounds
 like you are setup for JBoD (Just a Bunch of Disks), which is a method
 of just hooking a bunch of disks onto the RAID controllers and trying
 to use them as noral disks. I've never tried it, so I can't really
 offer and advice if this is your configuration.
 
> i've seen reference in the archives to the following:
>
> - you need to low-level format the drives before putting them under
> swxcr control
 
 The swxcrmgr (or srlmgr) software should take care of this for you,
 the biggest problem is that you have to halt the system and run it from
 the console. See my comments on the standalone config utility below....
 
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
 
 Check with DEC if you can. For most of the rzXX disks, firmware is now
 considered customer installable and if you are on a service contract
 they will make the images available for DL. Make sure they are aware
 that the disks will be used with a hardware RAID controller. I was never
 clear if this was a DEC firmware problem or a Seagate (used as 3rd party
 disks), but there were rumors of certain firmware versions doing an
 'energy star' spin-down after a period of inactivity. The swxcr would
 see the disk had stopped spining and fail it. (Again this is just a
 rumor as far as I can verify.)
 
 One important thing it to try to make sure that all of the drives are
 at the same firmware level.
  
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 The 're' disks are variable geometry disktabs, used for arrays who's
 size is computed rather than fixed. Just take a quick look in your
 /etc/disktab to get an idea of how it differs from normal disktab
 entries.
 
> > i suspect these things just have a bunch of bad blocks or are
> otherwise defective, but if anyone can
 
 I don't really know how well you know the swxcr stuff, so forgive
 me if I'm being a little rumedial.
 
 In order to do anything more than parity checking and rebuilding
 drives (after replacing the failed drive), you are going to have
 to take the system down and use one of the stand-alone configuration
 utilities from the console. The latest version of these utilites that
 I am aware of is 3.11 and you need to use that to be compatable with
 the later versions of DU 3.2. The utilites should have come on
 a floppy disk (along with instructions).
 
 If you can't find the floppy, all is not lost, the later versions of
 of the Firmware CD's have the utilites.
 
 I should say something about which utility to use: if you have a
 graphics head and a PC style keyboard, use the swxcrmgr.exe utility.
 If you are using a serial terminal for a graphics head: connect a
 PC or Mac to the serial line (via a null modem and your choice of
 terminal emulation software), and use the srlmgr.exe utility.
 
 The swxcrmgr software expects a graphic head and runs ~200-1,000%
 slower on a serial head. The srlmgr utility was designed for
 use with serial heads and works well in this mode, but still expects
 a PC style keyboard.
 
 So, now you are thinking to yourself: OK, I have the firmware CD
 (we'll use the v3.5 CD as an example), but how do I run a program
 off of it from the SRM console? Here is how we run srlmgr on our
 8400 (which has no grphics head):
 
 1. Halt the system down to the SRM Console (">>>" prompt).
 
 2. Do a "sho dev" to figure out the device name of your
    CD-ROM drive, i.e. DKd500.
    
 3. Load support for iso9660 file systems (if your SRM console
    can't do this, you are going to have to make a floppy).
    
>>> load -f iso9660
    
 4. List the directory, to make sure you are in the right place....
 
>>> ls iso9660:[CDROM.UTILITY]/DKd500
    
>>> ls iso9660:[CDROM.UTILITY.SWXCRMGR]/DKd500
    
    . . .
    
    ad nausium....
    
 5. Run the srlmgr. (The -p1, tells it to look for the controllers
    on the first PCI bus.)
 
>>> run iso9660:[CDROM.UTILITY.SWXCRMGR]SRLMGR.EXE -d DKd500 -p1
    
 NOTE: The drive name in the form "DKd500" IS case sensitive and it
 will cause commands to fail if you mess up the case on the drive
 name.
 
 If you need to make floppies, you can copy the utilites to a FAT
 formatted disk using mtools or a PC with a CDROM drive.
 
 Hope this helps,

------- Message 7

 Wow, I was just about to write a message to the mailing-list about the same
 problem! I have had the same problem with _one_ disk out of five (rz29b-vw,
 all unlabelled) that we have installed identically in a SWXCR of a 2100
 4/275. The fact that this happens with only one disk, even in different
 slots of the RAID controller, leads me to believe it is a problem with the
 disk itself being faulty.
 
 I found that, like you said, if I only re-optimized the disk with swxcrmgr
 after a "failure", it would be failed again soon thereafter. Instead, if I
 optimize, unmount the disk, run fsck, and remount it, it seems to work
 fine...for a while. I have had the problem three times with this disk in
 about six months. I want to have the disk replaced, especially before its
 warranty runs out, but the first time the Digital technician came, switched
 the disk to another slot in the RAID controller, and said, "the disk is
 fine, no reason to replace it." So if I can be sure it is a problem with
 the disk itself, I can push to get a replacement.
 
 I assume some error develops on the disk, the SWXCR reaches a limit in the
 number of errors allowed from the disk, and fails it (see section 5.2 of
 the "StorageWorks RAID Array 200 Online Managerment Utility for Digital
 Unix"). I now have swxcrmon running so I'll see the error messages next
 time it fails, but that could be a couple of months!
 
 Anyway, sorry I can't be of much help other to say you are not alone!
 Please let us know what you find out. Thanks.

------- Message 8

> - you need to low-level format the drives before putting them under
> swxcr control
 
 Couldn't hurt.
 
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
 
 I could probably be more helpful if you could go into the SWXCR utility
 and get the drive model and firmware rev. from the drive information
 screen.
 
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 This might mess up your filesystem, but it would not make the drives
 go offline.
 
> i'd appreciate it greatly. things were more or less fine until this
> last batch of disks went in, which suggests that either we got a bunch
> of defective disks (of varying model) or the last sysadmin forgot a
> step.
 
 I'm assuming that they are all set up as individual disks (JBOD).
 FYI -- there is also a utility which runs under digital UNIX which will
 allow you to do some monitoring of the swxcr and mark the drives
 optimal. It comes on one of the floppies in the white StorageWorks
 RAID Array 230 subsystem software box. If you don't have that software
 kit, the part number is qb-2xhah-sb. It's under $200 bucks and contains
 all the manuals, software, etc.
 
 Good luck,
 
------- Message 9

 I have posted my detailed experience with RZ29s in SWXCR that had bad
 firmware (older than 0014)(see the archives). They would time out
 basically. The RZ29s have some kind of power saving option that would
 cause timeouts. The solution was to upgrade the firmware. It fixed
 the problem.
 
 I haven't used RZ28s in the SWXCR so don't know if similar problems
 exist.
 
 I have been told to make sure that all drives have the same level of
 firmware - don't mix levels.
 
 I would contact DEC to determine the current latest version of
 firmware for the RZ28s you are using and then check the drives to make
 sure that are all up to date.
 
 You can use scu to read the firmware level, but not in the swxcr
 array. As far as I know you have to connect them to one of the system
 scsi controllers to use scu. I had a storage works pizza box on my
 external scsi connection that I used.
 
------- Message 10

> - you need to low-level format the drives before putting them under
> swxcr control
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
> - you need to disklabel as 're' or 'SWXCR', not as rz28
 
 There is indeed a certain level and/or type of rz28 needed to work with
 the swxcr controller... I can't find my doc on it (I know I have it
 around here somewhere) but DEC ought to be able to tell you what the
 minimum version of rz28 is that works with the swxcr controller.
 
> - refer me to the documentation that says the things above
> - provide some good war stories that might provide clues
 
 I have (somewhere) a doc on what level of rz28 you need, but other than
 that, all I can say is I have a bunch of rz28 and rz29 drives (about 34
 in total, over half rz28 drives) on two 2100's (a 4/200 and a 4/275)
 both with swxcr controllers, and have never had a problem except for
 two disk failures (both within a week of installation, one on each machine --
 since we installed them 16 at a time, I don't see 2 bad out of 32 drives,
 both failing within a week after installation, and no problems since then,
 as a big problem).

------- End of Forwarded Messages
Received on Tue Dec 03 1996 - 23:03:11 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT