Hi! this is a summary from a long time ago - we had the
problem for three and a half months and I have to say,
although I'm somewhat ashamed I didn't chase up the
obvious possibilities myself, that Digital's response and
attitude was pitiful. Briefly (long detailed original messages
below) after we upgraded the 4100s' firmware from 4.8
to 5.3 (or indeed any rev level above 4.8) a ton of SCSI
CAM errors poured from the system disks at bootup
so as to make the systems unusable. After many peculiar
red herrings (one system could be made to work by making
all the disks of the same type, etc) we found that it was
SCSI termination! DOH! specifically the "personality
cards" that slot in the disk shelves and talk to the KZPDA
disk controllers have two sockets in, one on the front that
does not enable shelf termination, and one on the top (sort of)
that does enable it. moving from the former to the latter
fixed it. so there you go. i guess the moral is to start simple,
not to trust the engineers if they can't find a problem with
the configuration, and not to assume that just because a
configuration has been working for years it is correct...
thanks to everyone too numerous to mention for their
suggestions.
john speakman, mskcc, nyc
(original messages follow)
hi everybody,
i wonder if any of you have any ideas about this one? compaq have
been struggling with it for nearly three weeks and we are getting a
bit desperate.
this is a Production Server cluster of two 4100s; most of the
disks are on the shared SCSI bus, connected via KZPSAs and
HSZ50s. As required by Production Server, each 4100 also
has local system disks connected via KZPDAs. The local
systems disk use AdvFS disks and are mirrored using LSM.
Each mirrorset (there are two mirrorsets, one for / and one
for /usr) consists of one RZ29B disk and one (newer,
allegedly identical except for it being blue instead of gray)
RZ1B-CS (this is significant, bear with me).
we are trying to take the cluster from 4.0B to 4.0E. as a preliminary
they say we should upgrade the system firmware on the 4100s (it's
currently version 4.8, the newest is 5.3). but as soon as the firmware
is upgraded from the CD, we get swamped with SCSI CAM hard
errors at bootup. They come from the RZ1B-CS disk that's half
of the / mirrorset. it takes an hour (instead of maybe five minutes)
to boot up to multi-user because there are so many SCSI errors.
if we put firmware 4.8 back in, everything's fine again.
if we yank out the RZ1B-CS that gives the errors, the same errors
start coming from the other RZ1B-CS (the one that's part of /usr)
instead. what you see on the disk is the light flashing on and off real
slowly, totally unlike the fast flashing that indicates disk access.
compaq said the SRM firmware upgrade also upgrades the KZPDA
firmware on the quiet and it is this that is producing some kind of
conflict.
firmware 5.0, 5.1 and 5.2 have the same effect as 5.3. compaq
have tried replacing the KZPDA which has had no effect at all.
now the engineer is running round trying to find some old RZ29B-VW
disks to slot in instead of the RZ1B-CSes. compaq engineering
meanwhile are starting to clutch at straws, e.g. sending the engineer
to our site just to check that everything is terminated properly (it is;
furthermore its been running non-stop just fine for two years).
however we don't want to play too much with this system as it's
kinda critical; also replacing disks that are part of an LSM mirrored
system disk is a bit of a pain. we haven't touched the second alpha
is we can't afford any downtime on it.
we have patched 4.0B up to date (patch kit 9). i think that's all the
pertinent info. anyone seen anything at all like this? any suggestions
would be very welcome. thanks!
john speakman
memorial sloan-kettering cancer center, nyc
WARNING: long convoluted description of problem and plea
for help..
hi everyone,
i posted about this problem some time ago and it is still with us
despite some helpful suggestions from people on this list, a lot
of late nights on my part and pretty much sod all help from digital
despite the fact that this call has been unresolved for over two
months now.
to recap, this is a Production Server cluster of two 4100s. the
disks are all 4.3 GB disks, either RZ29B (gray) or RZ1CB-CS
(blue). most of the disks are on the shared SCSI bus, connected
via KZPSAs, using HSZ50s. as required by Production Server,
each 4100 also has local system disks inside a BA356-RA,
connected via a KZPDA. the local system disks use AdvFS
and are mirrored using LSM. when we started, each mirrorset
consisted of one RZ29B disk and one (newer, allegedly
identical except for color) RZ1B-CS. currently we are using
unix 4.0B patch kit 9 (although 4.0b patch kit 3 exhibited the
same symptoms).
we are trying to take the cluster from unix 4.0B to 4.0E. as
a preliminary measure it was naturally recommended that
we upgrade the system firmware on the 4100s (currently
version 4.8) to the newest (5.3). but as soon as we tried
upgrading the firmware on the first machine to 5.0 or higher
(from the firmware CD - we tried 5.0, 5.1, 5.2 and 5.3), we
got swamped with SCSI CAM hard errors at bootup. they
came from the local SCSI bus and always from the second
disk in the mirrorset, an RZ1CB-CS although the first disk
in the mirrorset is an RZ29B. if we pulled the second disk out
and rebooted, it booted OK (as the disk is mirrored) but
then we get SCSI CAM errors from the second disk in the
second mirrorset, also an RZ1CB-CS that is mirroring an
RZ29B, i.e. it's always the *second* disk in the *first* mirrorset
it sees.
the error messages can be seen as soon as the "starting second
CPU" message (not quite sure those are the exact words) appears
on the console, i.e. when you do a single user mode boot
it's the last thing before the single user prompt. you can see
when the disk is about to exhibit the SCSI CAM errors as
the activity light comes on and stays on for a second or so,
then blinks off, then comes on for a second or so, etc. it
is very distinctive and unlike the fast blinking that indicates
regular disk activity. booting up to multi-user mode is
possible when the errors are happening, but the system
is pretty much useless (bootup takes an hour instead of maybe
five minutes).
as soon as we downgrade back to 4.8 firmware, everything
is fine again.
digital, plausibly, decided that the SRM firmware upgrade is
upgrading the KZPDA firmware on the quiet and this seems
to be producing some kind of conflict. our engineer tried
replacing the KZPDA to no avail and was hunting around
for some old RZ29B-VW disks to replace the RZ1CB-CSes
(as we'd surmised it was the combination of RZ29B and
RZ1CB-CS disks in the SAME mirrorset COUPLED with the
upgrade of the firmware beyond 4.8 that was causing the
problems), but we went ahead without him and did it the other way
around, by pulling out the RZ29Bs and using up some
sparesets from the shared SW800 cabinet to make the system
disks all RZ1CB-CSes. this worked fine and we upgraded the
firmware to 5.3 last week and this system has been running
perfectly for a week or so. so i presumed i'd figured it out and
my surmising was right and i let digital know that we had
found a workaround, no thanks to them, and that we still
expected them to figure out the problem.
tonight, however, more trouble. we did the same on the
other 4100 system; i.e. hauled it down, pulled out the
RZ29Bs and replaced them with RZ1CB-CSes, rebuilt the
system mirrors then applied the firmware upgrade, only to
find that the SCSI CAM errors are still with us.
so here we have an AlphaServer 4100 running 4.0B
PK9, using LSM mirroring on identical RZ1CB-CS
system disks, and we can't upgrade the firmware
above 4.8, even though making the system disks all
RZ1CB-CS worked on the other system. again it's always
the second, fourth or sixth disk (this system has three
mirrorsets instead of two on the KZPDA). FYI it doesn't
matter which disk you boot off (using set bootdef_dev
from the console prompt), the errors always come from the
second disk in the mirror.
so we too are stumped with this one. we thought we had
a workaround but were wrong. help! any ideas?
there were a couple of interesting suggestions from the list,
most notably from rob shurtleff who said we should try and
get digital to upgrade the backplane of each 4100; i floated
this with digital engineering and they kinda brushed it off;
i'm re-floating it a bit more forcefully now. brian parkhurst
suggested upgrading the firmware of the rz1cb-cses but
digital said any upgrades to this firmware were "cosmetic".
the initial post of mine caused me to post a couple of
sub-questions:
(1) how to replace LSM mirrored system disks - a lot of
people replied to this with very complete instructions
which i will summarize very soon i promise; we are getting
quite expert at this now =)
(2) whether we could/should do hardware mirroring of the
system disks instead of using LSM; we are going to
take this route anyway after being bitten several times by
the LSM system disk mirroring issue, even though we
still need LSM on our non-system disks for other things.
thanks to everyone who replied to this especially bill
david and viktor holmberg who pointed me to KZPAC
controllers which are exactly what we need and are buying.
thanks for reading this HUGE message =)
john
Received on Tue Jul 27 1999 - 17:56:45 NZST