well, i said i'd report back when i'd had a chance to muck with this
some more...
i brought the 2100/A500MP (aka 4/200) down over winter break and
checked out the firmware. the "old" (read: working) disks were all
rz28's at f/w rev 441C, and the two swxcr controllers were at f/w rev
1.99 (!). the "new" disks (added later) were a combination of rz28's
(at f/w revs 442D and T436) and rz28m's (at f/w rev 568). the "right
answer" was obviously to update all of the f/w, but that option gave
me the willies since i didn't have enough cold spares to recover
smoothly if a large number of disks failed after update. instead, i
moved things around so that the "new" disks were on a regular scsi
controller and the swxcr-controlled racks contained identical disks
(all at 441C). this configuration appears to be stable.
so: the advice about using identical disks at identical f/w levels
appears to be very sound. (at least, if you have old swxcr f/w; we'll
update f/w when we do the next operating system upgrade.)
the following things do not appear to have been factors, or didn't
work:
1. i don't think logical disk initialization was an issue for us,
since we have only been using JBOD; our last sysadmin never
initialized the disks he installed, and uninitialized disks at the
right f/w level have been problem-free (aside from the constant
complaints from swxcrmgr). mind you, *i* have been initializing new
disks. i'm only mentioning this because it might help someone else
who is troubleshooting.
2. the other thing i tried (as a quick hack, to check out the spindown
theory) was a script that touched each swxcr-controlled file system
once every X seconds and then sync'd. X=5 might have been working --
at least, the swxcr didn't fail any disks for the week or so while i
was using that setting. X=60 definitely didn't work.
3. when i moved the "new" drives to a standard scsi bus, i did a media
verification using scu and didn't find any bad blocks. my original
hypothesis (just plain bad disks) doesn't hold up.
thanks again to:
Todd Acheson <acheson_at_oak.cats.ohiou.edu>
Javier Aida <jaida_at_gmd.com.pe>
tra_at_ucolick.org (Ted Asocks)
"Edward C. Bailey" <ed_at_pigdog.niehs.nih.gov>
"Dave Golden" <golden_at_falcon.invincible.com>
"William H. Magill" <magill_at_isc.upenn.edu>
Kevin Reardon <kreardon_at_cerere.na.astro.it>
Eric.Rostetter_at_utoledo.edu
sanghvi_at_proto.wilm.ge.com (arun sanghvi)
Tom Webster <webster_at_europa.mdc.com>
------- Forwarded Messages
From: aoki_at_CS.Berkeley.EDU (Paul M. Aoki)
To: alpha-osf-managers_at_ornl.gov
Subject: swcxr rz28s tripping offline
Date: Mon, 02 Dec 96 14:11:15 -0800
i recently inherited a 2100/500MP and have been tangling with the
%$#_at_ swxcr ever since. here's the story:
the last sysadmin popped a bunch of new (in the sense that they are
replacements provided by dec in sealed plastic bags --
remanufactured?) rz28's and rz28m's into the machine, marked them as
optimal, disklabelled them as 'rz28' (or left them unlabelled) and
then newfs'd them. the disks run for a few days, then trip offline
suddenly. sometimes there are references to bad block reads (e.g.,
during dumps). recovery requires use of swxcrmgr to reset them as
'optimal', but this doesn't last very long, which is annoying.
i've seen reference in the archives to the following:
- you need to low-level format the drives before putting them under
swxcr control
- the rz28's need to have certain firmware revs before being put under
swxcr control
- you need to disklabel as 're' or 'SWXCR', not as rz28
i suspect these things just have a bunch of bad blocks or are
otherwise defective, but if anyone can:
- refer me to the documentation that says the things above
- provide some good war stories that might provide clues
i'd appreciate it greatly. things were more or less fine until this
last batch of disks went in, which suggests that either we got a bunch
of defective disks (of varying model) or the last sysadmin forgot a
step.
--
Paul M. Aoki | University of California at Berkeley
aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
| Berkeley, CA 94720-1776
------- Message 2
If the drives attached to the swxcr are set up with any RAID level (ie,
not set up as JBOD), you most definitely need to use swxcrmgr (or srlmgr)
to initialize the drives. I am not sure if you have to have swxcrmgr
initialize JBOD drives, but it wouldn't suprise me...
------- Message 3
I can't be explicit with documentation, but with the experience i've
had with around 30 installations involving SWXCR, i can give you some
clues:
- An old firmware version on the controller ( if PCI, you'll need
2.36, if EISA, you'll need 2.16) will cause very, very weird
problems.
- When you created the units (RAID 0, 1, 5) or JBOD on SWXCRMGR, you
needed to init them as indicated in the SWXCR manual. A failure in
doing this may cause you unexpected problems while accessing disks,
even after working with them for some time.
- If you have disklabeled the disks as RZ28, you're wrong, you'll need
to init them as "swxcr". For example:
# disklabel -rw re2 swxcr
Hope this helps
------- Message 4
I also run into this problem. I manage to get a firmware
upgrade floppy from my FE but have not been sucessful in
upgrading the firmware.
------- Message 5
I have a similar setup, except my OS is on the re device. Our SWXCR
problems resulted in system hangs and crashes. Finally dug up a
patch that solved the problem (on OSF/1 3.2). If you don't have
this patch installed and are at 3.2 I'd recommend it. Hope this
helps.
PATCH ID: OSF320-228 Subset(s): OSFBINCOM320,OSFHWBIN320
Supersede OSF320-137, OSF320-193
------- Message 6
What RAID configuration are you running the disks under: 0, 0+1, 3, 5,
JBoD?
From the description of the disks as being labeled 'rz28', it sounds
like you are setup for JBoD (Just a Bunch of Disks), which is a method
of just hooking a bunch of disks onto the RAID controllers and trying
to use them as noral disks. I've never tried it, so I can't really
offer and advice if this is your configuration.
> i've seen reference in the archives to the following:
>
> - you need to low-level format the drives before putting them under
> swxcr control
The swxcrmgr (or srlmgr) software should take care of this for you,
the biggest problem is that you have to halt the system and run it from
the console. See my comments on the standalone config utility below....
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
Check with DEC if you can. For most of the rzXX disks, firmware is now
considered customer installable and if you are on a service contract
they will make the images available for DL. Make sure they are aware
that the disks will be used with a hardware RAID controller. I was never
clear if this was a DEC firmware problem or a Seagate (used as 3rd party
disks), but there were rumors of certain firmware versions doing an
'energy star' spin-down after a period of inactivity. The swxcr would
see the disk had stopped spining and fail it. (Again this is just a
rumor as far as I can verify.)
One important thing it to try to make sure that all of the drives are
at the same firmware level.
> - you need to disklabel as 're' or 'SWXCR', not as rz28
The 're' disks are variable geometry disktabs, used for arrays who's
size is computed rather than fixed. Just take a quick look in your
/etc/disktab to get an idea of how it differs from normal disktab
entries.
> > i suspect these things just have a bunch of bad blocks or are
> otherwise defective, but if anyone can
I don't really know how well you know the swxcr stuff, so forgive
me if I'm being a little rumedial.
In order to do anything more than parity checking and rebuilding
drives (after replacing the failed drive), you are going to have
to take the system down and use one of the stand-alone configuration
utilities from the console. The latest version of these utilites that
I am aware of is 3.11 and you need to use that to be compatable with
the later versions of DU 3.2. The utilites should have come on
a floppy disk (along with instructions).
If you can't find the floppy, all is not lost, the later versions of
of the Firmware CD's have the utilites.
I should say something about which utility to use: if you have a
graphics head and a PC style keyboard, use the swxcrmgr.exe utility.
If you are using a serial terminal for a graphics head: connect a
PC or Mac to the serial line (via a null modem and your choice of
terminal emulation software), and use the srlmgr.exe utility.
The swxcrmgr software expects a graphic head and runs ~200-1,000%
slower on a serial head. The srlmgr utility was designed for
use with serial heads and works well in this mode, but still expects
a PC style keyboard.
So, now you are thinking to yourself: OK, I have the firmware CD
(we'll use the v3.5 CD as an example), but how do I run a program
off of it from the SRM console? Here is how we run srlmgr on our
8400 (which has no grphics head):
1. Halt the system down to the SRM Console (">>>" prompt).
2. Do a "sho dev" to figure out the device name of your
CD-ROM drive, i.e. DKd500.
3. Load support for iso9660 file systems (if your SRM console
can't do this, you are going to have to make a floppy).
>>> load -f iso9660
4. List the directory, to make sure you are in the right place....
>>> ls iso9660:[CDROM.UTILITY]/DKd500
>>> ls iso9660:[CDROM.UTILITY.SWXCRMGR]/DKd500
. . .
ad nausium....
5. Run the srlmgr. (The -p1, tells it to look for the controllers
on the first PCI bus.)
>>> run iso9660:[CDROM.UTILITY.SWXCRMGR]SRLMGR.EXE -d DKd500 -p1
NOTE: The drive name in the form "DKd500" IS case sensitive and it
will cause commands to fail if you mess up the case on the drive
name.
If you need to make floppies, you can copy the utilites to a FAT
formatted disk using mtools or a PC with a CDROM drive.
Hope this helps,
------- Message 7
Wow, I was just about to write a message to the mailing-list about the same
problem! I have had the same problem with _one_ disk out of five (rz29b-vw,
all unlabelled) that we have installed identically in a SWXCR of a 2100
4/275. The fact that this happens with only one disk, even in different
slots of the RAID controller, leads me to believe it is a problem with the
disk itself being faulty.
I found that, like you said, if I only re-optimized the disk with swxcrmgr
after a "failure", it would be failed again soon thereafter. Instead, if I
optimize, unmount the disk, run fsck, and remount it, it seems to work
fine...for a while. I have had the problem three times with this disk in
about six months. I want to have the disk replaced, especially before its
warranty runs out, but the first time the Digital technician came, switched
the disk to another slot in the RAID controller, and said, "the disk is
fine, no reason to replace it." So if I can be sure it is a problem with
the disk itself, I can push to get a replacement.
I assume some error develops on the disk, the SWXCR reaches a limit in the
number of errors allowed from the disk, and fails it (see section 5.2 of
the "StorageWorks RAID Array 200 Online Managerment Utility for Digital
Unix"). I now have swxcrmon running so I'll see the error messages next
time it fails, but that could be a couple of months!
Anyway, sorry I can't be of much help other to say you are not alone!
Please let us know what you find out. Thanks.
------- Message 8
> - you need to low-level format the drives before putting them under
> swxcr control
Couldn't hurt.
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
I could probably be more helpful if you could go into the SWXCR utility
and get the drive model and firmware rev. from the drive information
screen.
> - you need to disklabel as 're' or 'SWXCR', not as rz28
This might mess up your filesystem, but it would not make the drives
go offline.
> i'd appreciate it greatly. things were more or less fine until this
> last batch of disks went in, which suggests that either we got a bunch
> of defective disks (of varying model) or the last sysadmin forgot a
> step.
I'm assuming that they are all set up as individual disks (JBOD).
FYI -- there is also a utility which runs under digital UNIX which will
allow you to do some monitoring of the swxcr and mark the drives
optimal. It comes on one of the floppies in the white StorageWorks
RAID Array 230 subsystem software box. If you don't have that software
kit, the part number is qb-2xhah-sb. It's under $200 bucks and contains
all the manuals, software, etc.
Good luck,
------- Message 9
I have posted my detailed experience with RZ29s in SWXCR that had bad
firmware (older than 0014)(see the archives). They would time out
basically. The RZ29s have some kind of power saving option that would
cause timeouts. The solution was to upgrade the firmware. It fixed
the problem.
I haven't used RZ28s in the SWXCR so don't know if similar problems
exist.
I have been told to make sure that all drives have the same level of
firmware - don't mix levels.
I would contact DEC to determine the current latest version of
firmware for the RZ28s you are using and then check the drives to make
sure that are all up to date.
You can use scu to read the firmware level, but not in the swxcr
array. As far as I know you have to connect them to one of the system
scsi controllers to use scu. I had a storage works pizza box on my
external scsi connection that I used.
------- Message 10
> - you need to low-level format the drives before putting them under
> swxcr control
> - the rz28's need to have certain firmware revs before being put under
> swxcr control
> - you need to disklabel as 're' or 'SWXCR', not as rz28
There is indeed a certain level and/or type of rz28 needed to work with
the swxcr controller... I can't find my doc on it (I know I have it
around here somewhere) but DEC ought to be able to tell you what the
minimum version of rz28 is that works with the swxcr controller.
> - refer me to the documentation that says the things above
> - provide some good war stories that might provide clues
I have (somewhere) a doc on what level of rz28 you need, but other than
that, all I can say is I have a bunch of rz28 and rz29 drives (about 34
in total, over half rz28 drives) on two 2100's (a 4/200 and a 4/275)
both with swxcr controllers, and have never had a problem except for
two disk failures (both within a week of installation, one on each machine --
since we installed them 16 at a time, I don't see 2 bad out of 32 drives,
both failing within a week after installation, and no problems since then,
as a big problem).
------- Message 11
We just went through hell with this problem.
It IS a Firmware problem with the drives.
The RZ28 and RZ29 drives (Seagates I think) "spin down," when not accessed.
I don't recall the gory details but I think they are 7200 RPM drives that
run hot. In order to use them in SW enclosures they were allowd to spin
down to ?5200? RPM when not accessed in some period of time (very short,
like seconds). When the SWXCR goes to access them, it thinks they are
down, because they are "not ready".
Naturally I don't recall the rev levels, but the drives MUST be at the
highest level - and the ones DEC shipped us in September must have come
from the "back of the shelf" because they were out of rev. The drives are
only a problem in the SWXCR controller.
------- Message 12
> still haven't checked the firmware levels (i don't know of any way to
> do that with the drives still in the internal shelves) but that will
> pop to the top of my list when i take the machine down.
That's one of the reason's I've come to hate the swxcr.
It is really a damm PC part and doesn't integrate at all with DU.
The only thing I can say about it is - when it works, it works.
It's hell when it doesn't because Field Engineering doesn't know
squat about it either and now that Digital sold off Storage Works
it is a real pain in the butt.
By the way, one other aspect of this - I believe that this problem is
specific to VW drives; ie fast wide.
------- Message 13
I wasn't clear on what style RAID you were using.
We use RAID level 5 with "hot spares".
If you use the same you could pull drives from the live machine to
check firmware levels - though this would degrade performance.
DEC is pretty flaky about the firmware issues.
Also, as others have mentioned be sure the SWXCR firmware is up to
date.
------- End of Forwarded Messages
--
Paul M. Aoki | University of California at Berkeley
aoki_at_CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
| Berkeley, CA 94720-1776
Received on Thu Jan 16 1997 - 01:58:20 NZDT