Update but no resolution yet: 4100 SRM upgrade gives SCSI errors from KZPDA [LONG]

From: John Speakman <speakman_at_biost.mskcc.org>
Date: Tue, 18 May 1999 23:20:34 -0400

WARNING: long convoluted description of problem and plea
for help..

hi everyone,

i posted about this problem some time ago and it is still with us
despite some helpful suggestions from people on this list, a lot
of late nights on my part and pretty much sod all help from digital
despite the fact that this call has been unresolved for over two
months now.

to recap, this is a Production Server cluster of two 4100s. the
disks are all 4.3 GB disks, either RZ29B (gray) or RZ1CB-CS
(blue). most of the disks are on the shared SCSI bus, connected
via KZPSAs, using HSZ50s. as required by Production Server,
each 4100 also has local system disks inside a BA356-RA,
connected via a KZPDA. the local system disks use AdvFS
and are mirrored using LSM. when we started, each mirrorset
consisted of one RZ29B disk and one (newer, allegedly
identical except for color) RZ1B-CS. currently we are using
unix 4.0B patch kit 9 (although 4.0b patch kit 3 exhibited the
same symptoms).

we are trying to take the cluster from unix 4.0B to 4.0E. as
a preliminary measure it was naturally recommended that
we upgrade the system firmware on the 4100s (currently
version 4.8) to the newest (5.3). but as soon as we tried
upgrading the firmware on the first machine to 5.0 or higher
(from the firmware CD - we tried 5.0, 5.1, 5.2 and 5.3), we
got swamped with SCSI CAM hard errors at bootup. they
came from the local SCSI bus and always from the second
disk in the mirrorset, an RZ1CB-CS although the first disk
in the mirrorset is an RZ29B. if we pulled the second disk out
and rebooted, it booted OK (as the disk is mirrored) but
then we get SCSI CAM errors from the second disk in the
second mirrorset, also an RZ1CB-CS that is mirroring an
RZ29B, i.e. it's always the *second* disk in the *first* mirrorset
it sees.

the error messages can be seen as soon as the "starting second
CPU" message (not quite sure those are the exact words) appears
on the console, i.e. when you do a single user mode boot
it's the last thing before the single user prompt. you can see
when the disk is about to exhibit the SCSI CAM errors as
the activity light comes on and stays on for a second or so,
then blinks off, then comes on for a second or so, etc. it
is very distinctive and unlike the fast blinking that indicates
regular disk activity. booting up to multi-user mode is
possible when the errors are happening, but the system
is pretty much useless (bootup takes an hour instead of maybe
five minutes).

as soon as we downgrade back to 4.8 firmware, everything
is fine again.

digital, plausibly, decided that the SRM firmware upgrade is
upgrading the KZPDA firmware on the quiet and this seems
to be producing some kind of conflict. our engineer tried
replacing the KZPDA to no avail and was hunting around
for some old RZ29B-VW disks to replace the RZ1CB-CSes
(as we'd surmised it was the combination of RZ29B and
RZ1CB-CS disks in the SAME mirrorset COUPLED with the
upgrade of the firmware beyond 4.8 that was causing the
problems), but we went ahead without him and did it the other way
around, by pulling out the RZ29Bs and using up some
sparesets from the shared SW800 cabinet to make the system
disks all RZ1CB-CSes. this worked fine and we upgraded the
firmware to 5.3 last week and this system has been running
perfectly for a week or so. so i presumed i'd figured it out and
my surmising was right and i let digital know that we had
found a workaround, no thanks to them, and that we still
expected them to figure out the problem.

tonight, however, more trouble. we did the same on the
other 4100 system; i.e. hauled it down, pulled out the
RZ29Bs and replaced them with RZ1CB-CSes, rebuilt the
system mirrors then applied the firmware upgrade, only to
find that the SCSI CAM errors are still with us.

so here we have an AlphaServer 4100 running 4.0B
PK9, using LSM mirroring on identical RZ1CB-CS
system disks, and we can't upgrade the firmware
above 4.8, even though making the system disks all
RZ1CB-CS worked on the other system. again it's always
the second, fourth or sixth disk (this system has three
mirrorsets instead of two on the KZPDA). FYI it doesn't
matter which disk you boot off (using set bootdef_dev
from the console prompt), the errors always come from the
second disk in the mirror.

so we too are stumped with this one. we thought we had
a workaround but were wrong. help! any ideas?

there were a couple of interesting suggestions from the list,
most notably from rob shurtleff who said we should try and
get digital to upgrade the backplane of each 4100; i floated
this with digital engineering and they kinda brushed it off;
i'm re-floating it a bit more forcefully now. brian parkhurst
suggested upgrading the firmware of the rz1cb-cses but
digital said any upgrades to this firmware were "cosmetic".

the initial post of mine caused me to post a couple of
sub-questions:

(1) how to replace LSM mirrored system disks - a lot of
people replied to this with very complete instructions
which i will summarize very soon i promise; we are getting
quite expert at this now =)

(2) whether we could/should do hardware mirroring of the
system disks instead of using LSM; we are going to
take this route anyway after being bitten several times by
the LSM system disk mirroring issue, even though we
still need LSM on our non-system disks for other things.
thanks to everyone who replied to this especially bill
david and viktor holmberg who pointed me to KZPAC
controllers which are exactly what we need and are buying.

thanks for reading this HUGE message =)
john
Received on Wed May 19 1999 - 03:22:58 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT