SUMMARY: swcxr rz28s tripping offline from Paul M. Aoki on 1996-12-04 (tru64-unix-managers)

From: Paul M. Aoki <aoki_at_CS.Berkeley.EDU>
Date: Tue, 03 Dec 96 13:26:20 -0800

many thanks to the following folks (in alphabetic order, not message
order; call me old-fashioned, i always feel kind of uncomfortable
forwarding other people's mail) for their speedy replies!

        Todd Acheson <acheson_at_oak.cats.ohiou.edu>
        Javier Aida <jaida_at_gmd.com.pe>
        tra_at_ucolick.org (Ted Asocks)
        "Edward C. Bailey" <ed_at_pigdog.niehs.nih.gov>
        "Dave Golden" <golden_at_falcon.invincible.com>
        Kevin Reardon <kreardon_at_cerere.na.astro.it>
        Eric.Rostetter_at_utoledo.edu
        sanghvi_at_proto.wilm.ge.com (arun sanghvi)
        Tom Webster <webster_at_europa.mdc.com>

this has given me a bunch of stuff to check out. i'll report back
when (if :-) i figure out exactly what's up.

thanks again to all!

------- Forwarded Messages

From: aoki_at_CS.Berkeley.EDU (Paul M. Aoki)
To: alpha-osf-managers_at_ornl.gov
Subject: swcxr rz28s tripping offline
Date: Mon, 02 Dec 96 14:11:15 -0800

i recently inherited a 2100/500MP and have been tangling with the
%$#_at_ swxcr ever since. here's the story:

the last sysadmin popped a bunch of new (in the sense that they are
replacements provided by dec in sealed plastic bags --
remanufactured?) rz28's and rz28m's into the machine, marked them as
optimal, disklabelled them as 'rz28' (or left them unlabelled) and
then newfs'd them. the disks run for a few days, then trip offline
suddenly. sometimes there are references to bad block reads (e.g.,
during dumps). recovery requires use of swxcrmgr to reset them as
'optimal', but this doesn't last very long, which is annoying.

i've seen reference in the archives to the following:

- you need to low-level format the drives before putting them under
   swxcr control
- the rz28's need to have certain firmware revs before being put under
   swxcr control
- you need to disklabel as 're' or 'SWXCR', not as rz28

i suspect these things just have a bunch of bad blocks or are
otherwise defective, but if anyone can:

- refer me to the documentation that says the things above
- provide some good war stories that might provide clues

i'd appreciate it greatly. things were more or less fine until this
last batch of disks went in, which suggests that either we got a bunch
of defective disks (of varying model) or the last sysadmin forgot a
step.
--
   Paul M. Aoki | University of California at Berkeley
   aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
                        | Berkeley, CA 94720-1776

------- Message 2

     If the drives attached to the swxcr are set up with any RAID level (ie,
not set up as JBOD), you most definitely need to use swxcrmgr (or srlmgr)
to initialize the drives. I am not sure if you have to have swxcrmgr
initialize JBOD drives, but it wouldn't suprise me...

------- Message 3

I can't be explicit with documentation, but with the experience i've
had with around 30 installations involving SWXCR, i can give you some
clues:

- An old firmware version on the controller ( if PCI, you'll need
  2.36, if EISA, you'll need 2.16) will cause very, very weird
problems.
- When you created the units (RAID 0, 1, 5) or JBOD on SWXCRMGR, you
needed to init them as indicated in the SWXCR manual. A failure in
doing this may cause you unexpected problems while accessing disks,
even after working with them for some time.
- If you have disklabeled the disks as RZ28, you're wrong, you'll need
to init them as "swxcr". For example:

         # disklabel -rw re2 swxcr

Hope this helps

------- Message 4

I also run into this problem. I manage to get a firmware
upgrade floppy from my FE but have not been sucessful in
upgrading the firmware.

------- Message 5

I have a similar setup, except my OS is on the re device. Our SWXCR
problems resulted in system hangs and crashes. Finally dug up a
patch that solved the problem (on OSF/1 3.2). If you don't have
this patch installed and are at 3.2 I'd recommend it. Hope this
helps.

Here is the README:

===================================================================

To apply the updated o-files, .c and .h files, do the following:

   - Please place a copy of this README.patch_id into a directory called
     /etc/patches for future reference on what patches are installed on this
     machine as these patches will have to be removed before upgrading to
     a newer version.
   - For Digital UNIX V4.0 binary files, please refer to the patch README and
     Appendix B for special installation instructions.
   - For OSF V3.2g and below, please follow the below instructions and any
     other special instructions in the patch README.
   - make a backup copy of the existing /sys/BINARY/o-files.
     if .o files are included.
   - make a backup copy of the existing .h file(s) as indicated in
     list of files if .h file(s) are included.
   - make a backup copy of the existing .c file(s) as indicated in
     list of files if .c file(s) are included.
   - copy all the o-file(s) to the host machine's /sys/BINARY/
     directory.
   - copy all the .h file(s) to the appropriate directory.
   - copy all the .c file(s) to the appropriate directory.
   - rebuild the host machine's kernel using doconfig (-c $HOSTNAME).
   - copy the new kernel from /sys/$HOSTNAME/vmunix to /vmunix
   - reboot the host machine

Checksums (produced by the "sum" command) are listed below.
(Notice - The sum for the README file may not match.)

README for Specific patches.

===============================================================================

/usr/sys/include/io/dec/eisa/xcr_port.h subset OSFBINCOM320
CHECKSUM: 09687 14 RCS: 1.1.20.2 (xcr_port.h)
/usr/sys/include/io/dec/eisa/re.h subset OSFBINCOM320
CHECKSUM: 28791 13 RCS: 1.1.16.2 (re.h)
/usr/sys/BINARY/re_driver.o subset OSFHWBIN320
CHECKSUM: 25716 52 RCS: 1.1.17.3 (re_driver.c)
/usr/sys/BINARY/xcr_port.o subset OSFHWBIN320
CHECKSUM: 52715 67 RCS: 1.1.26.4 (xcr_port.c)
/usr/sys/BINARY/xcr_logger.o subset OSFHWBIN320
CHECKSUM: 03933 20 RCS: 1.1.4.4 (xcr_logger.c)
---------------------

PATCH ID: OSF320-228 Subset(s): OSFBINCOM320,OSFHWBIN320
Supersede OSF320-137, OSF320-193

Installation Instructions: A kernel rebuild is required.

PROBLEM: (QAR 29902) (Patch ID: OSF320-137)
********
Installation to a disk connected by a PCI Raid disk controller
can fail with the following errors:

The following is the output of the session on the AlphaServer 2000
after selecting an advanced installation to re0
(device 0 on PCI, controller 0):
                                 .
                                 .
          Use SWXCR, re0, for your system disk? [] y
                                 .
                                 .
          Select the file system type for the root file system
          (advfs/ufs):[ufs]advfs

          Initializing the System Disk SWXCR, re0.....
          XCR_logger: XCR_ERROR packet
          XCR_Logger: cntrl 0 unit 0
          re_complete
          I/O failed
          Hard Error Detected
          ACTIVE XCR_COM at time of error
          ACTIVE CONTROLLER working set at time of error
          [repeated several times and then hangs indefinitely]

The following is the output of the session on the AlphaServer 1000
after selecting an advanced installation to re0
(device 0 on EISA, controller 0):
                                 .
                                 .
          Use SWXCR, re0, for your system disk? [] y
                                 .
                                 .
          Select the file system type for the root file system
          (advfs/ufs):[ufs]advfs

          Initializing the System Disk SWXCR, re0.....
          ERROR:
          The installation procedure is unable to initialize the
          system disk label

          disk label diagnostics:
          disklabel: ioctl DIOCWDINFO: no disk label on disk.
          use "disklabel -wr" to install initial label

PROBLEM: (QAR 30337) (Patch ID: OSF320-137)
********
Astro errors have been seen under V3.2 and V3.2B. The errors may be followed
by a panic. The database which is being built is always corrupted by the error.

The Astro errors look like this
...........................................................................

EVENT CLASS ERROR EVENT
OS EVENT TYPE 198. ASTRO CONTROLLER
ROUTINE NAME xcrintr
----- CAM STRING -----
                                         No SLOT_CMD_ACTIVE bit set
...........................................................................

EVENT CLASS ERROR EVENT
OS EVENT TYPE 198. ASTRO CONTROLLER
ROUTINE NAME xcr_cmd_timeout
----- CAM STRING -----
                                         Command has timed out

PROBLEM: (DMO100161, QAR39130) (Patch ID: OSF320-193)
********
System panics with "xcr_que_insert list corruption"

Stack trace:

> 0 stop_secondary_cpu() ["../../../../src/kernel/arch/alpha/cpu.c":352, 0xffff fc00004d9598]
[...]
   14 _Xsyscall(0x8, 0x3ff800d5e08, 0x1400099e0, 0x3, 0xc0207605) ["../../../../src/kernel/arch/alpha/loc
ore.s":1086, 0xfffffc00004dca44]

PROBLEM: (QAR 41490) (Patch ID: OSF320-228)
********
In some adverse situtations the SWXCR controller may hang.

There is a SWXCR Driver timing issue where a rebuild in process can
cause the controller to delay I/O completion and timeout commands.
The driver will start to reset the controller and eventually the
controller will hang. This is evident by the fact that I/Os issued
to that SWXCR controller do not complete.

------- Message 6

What RAID configuration are you running the disks under: 0, 0+1, 3, 5,
JBoD?

From the description of the disks as being labeled 'rz28', it sounds
like you are setup for JBoD (Just a Bunch of Disks), which is a method
of just hooking a bunch of disks onto the RAID controllers and trying
to use them as noral disks. I've never tried it, so I can't really
offer and advice if this is your configuration.

> i've seen reference in the archives to the following:
>
> - you need to low-level format the drives before putting them under
> swxcr control

The swxcrmgr (or srlmgr) software should take care of this for you,
the biggest problem is that you have to halt the system and run it from
the console. See my comments on the standalone config utility below....

> - the rz28's need to have certain firmware revs before being put under
> swxcr control

Check with DEC if you can. For most of the rzXX disks, firmware is now
considered customer installable and if you are on a service contract
they will make the images available for DL. Make sure they are aware
that the disks will be used with a hardware RAID controller. I was never
clear if this was a DEC firmware problem or a Seagate (used as 3rd party
disks), but there were rumors of certain firmware versions doing an
'energy star' spin-down after a period of inactivity. The swxcr would
see the disk had stopped spining and fail it. (Again this is just a
rumor as far as I can verify.)

One important thing it to try to make sure that all of the drives are
at the same firmware level.

> - you need to disklabel as 're' or 'SWXCR', not as rz28

The 're' disks are variable geometry disktabs, used for arrays who's
size is computed rather than fixed. Just take a quick look in your
/etc/disktab to get an idea of how it differs from normal disktab
entries.

> > i suspect these things just have a bunch of bad blocks or are
> otherwise defective, but if anyone can

I don't really know how well you know the swxcr stuff, so forgive
me if I'm being a little rumedial.

In order to do anything more than parity checking and rebuilding
drives (after replacing the failed drive), you are going to have
to take the system down and use one of the stand-alone configuration
utilities from the console. The latest version of these utilites that
I am aware of is 3.11 and you need to use that to be compatable with
the later versions of DU 3.2. The utilites should have come on
a floppy disk (along with instructions).

If you can't find the floppy, all is not lost, the later versions of
of the Firmware CD's have the utilites.

I should say something about which utility to use: if you have a
graphics head and a PC style keyboard, use the swxcrmgr.exe utility.
If you are using a serial terminal for a graphics head: connect a
PC or Mac to the serial line (via a null modem and your choice of
terminal emulation software), and use the srlmgr.exe utility.

The swxcrmgr software expects a graphic head and runs ~200-1,000%
slower on a serial head. The srlmgr utility was designed for
use with serial heads and works well in this mode, but still expects
a PC style keyboard.

So, now you are thinking to yourself: OK, I have the firmware CD
(we'll use the v3.5 CD as an example), but how do I run a program
off of it from the SRM console? Here is how we run srlmgr on our
8400 (which has no grphics head):

1. Halt the system down to the SRM Console (">>>" prompt).

2. Do a "sho dev" to figure out the device name of your
    CD-ROM drive, i.e. DKd500.

3. Load support for iso9660 file systems (if your SRM console
    can't do this, you are going to have to make a floppy).

>>> load -f iso9660

4. List the directory, to make sure you are in the right place....

>>> ls iso9660:[CDROM.UTILITY]/DKd500

>>> ls iso9660:[CDROM.UTILITY.SWXCRMGR]/DKd500

    . . .

    ad nausium....

5. Run the srlmgr. (The -p1, tells it to look for the controllers
    on the first PCI bus.)

>>> run iso9660:[CDROM.UTILITY.SWXCRMGR]SRLMGR.EXE -d DKd500 -p1

NOTE: The drive name in the form "DKd500" IS case sensitive and it
will cause commands to fail if you mess up the case on the drive
name.

If you need to make floppies, you can copy the utilites to a FAT
formatted disk using mtools or a PC with a CDROM drive.

Hope this helps,

------- Message 7

Wow, I was just about to write a message to the mailing-list about the same
problem! I have had the same problem with _one_ disk out of five (rz29b-vw,
all unlabelled) that we have installed identically in a SWXCR of a 2100
4/275. The fact that this happens with only one disk, even in different
slots of the RAID controller, leads me to believe it is a problem with the
disk itself being faulty.

I found that, like you said, if I only re-optimized the disk with swxcrmgr
after a "failure", it would be failed again soon thereafter. Instead, if I
optimize, unmount the disk, run fsck, and remount it, it seems to work
fine...for a while. I have had the problem three times with this disk in
about six months. I want to have the disk replaced, especially before its
warranty runs out, but the first time the Digital technician came, switched
the disk to another slot in the RAID controller, and said, "the disk is
fine, no reason to replace it." So if I can be sure it is a problem with
the disk itself, I can push to get a replacement.

I assume some error develops on the disk, the SWXCR reaches a limit in the
number of errors allowed from the disk, and fails it (see section 5.2 of
the "StorageWorks RAID Array 200 Online Managerment Utility for Digital
Unix"). I now have swxcrmon running so I'll see the error messages next
time it fails, but that could be a couple of months!

Anyway, sorry I can't be of much help other to say you are not alone!
Please let us know what you find out. Thanks.

------- Message 8

> - you need to low-level format the drives before putting them under
> swxcr control

Couldn't hurt.

> - the rz28's need to have certain firmware revs before being put under
> swxcr control

I could probably be more helpful if you could go into the SWXCR utility
and get the drive model and firmware rev. from the drive information
screen.

> - you need to disklabel as 're' or 'SWXCR', not as rz28

This might mess up your filesystem, but it would not make the drives
go offline.

> i'd appreciate it greatly. things were more or less fine until this
> last batch of disks went in, which suggests that either we got a bunch
> of defective disks (of varying model) or the last sysadmin forgot a
> step.

I'm assuming that they are all set up as individual disks (JBOD).
FYI -- there is also a utility which runs under digital UNIX which will
allow you to do some monitoring of the swxcr and mark the drives
optimal. It comes on one of the floppies in the white StorageWorks
RAID Array 230 subsystem software box. If you don't have that software
kit, the part number is qb-2xhah-sb. It's under $200 bucks and contains
all the manuals, software, etc.

Good luck,

------- Message 9

I have posted my detailed experience with RZ29s in SWXCR that had bad
firmware (older than 0014)(see the archives). They would time out
basically. The RZ29s have some kind of power saving option that would
cause timeouts. The solution was to upgrade the firmware. It fixed
the problem.

I haven't used RZ28s in the SWXCR so don't know if similar problems
exist.

I have been told to make sure that all drives have the same level of
firmware - don't mix levels.

I would contact DEC to determine the current latest version of
firmware for the RZ28s you are using and then check the drives to make
sure that are all up to date.

You can use scu to read the firmware level, but not in the swxcr
array. As far as I know you have to connect them to one of the system
scsi controllers to use scu. I had a storage works pizza box on my
external scsi connection that I used.

------- Message 10

> - you need to low-level format the drives before putting them under
> swxcr control
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
> - you need to disklabel as 're' or 'SWXCR', not as rz28

There is indeed a certain level and/or type of rz28 needed to work with
the swxcr controller... I can't find my doc on it (I know I have it
around here somewhere) but DEC ought to be able to tell you what the
minimum version of rz28 is that works with the swxcr controller.

> - refer me to the documentation that says the things above
> - provide some good war stories that might provide clues

I have (somewhere) a doc on what level of rz28 you need, but other than
that, all I can say is I have a bunch of rz28 and rz29 drives (about 34
in total, over half rz28 drives) on two 2100's (a 4/200 and a 4/275)
both with swxcr controllers, and have never had a problem except for
two disk failures (both within a week of installation, one on each machine --
since we installed them 16 at a time, I don't see 2 bad out of 32 drives,
both failing within a week after installation, and no problems since then,
as a big problem).

------- End of Forwarded Messages
Received on Tue Dec 03 1996 - 23:03:11 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT