4100 SRM upgrade gives SCSI errors from KZPDA

From: John Speakman <speakman_at_biost.mskcc.org>
Date: Thu, 29 Apr 1999 14:56:50 -0400

hi everybody,

i wonder if any of you have any ideas about this one? compaq have
been struggling with it for nearly three weeks and we are getting a
bit desperate.

this is a Production Server cluster of two 4100s; most of the
disks are on the shared SCSI bus, connected via KZPSAs and
HSZ50s. As required by Production Server, each 4100 also
has local system disks connected via KZPDAs. The local
systems disk use AdvFS disks and are mirrored using LSM.
Each mirrorset (there are two mirrorsets, one for / and one
for /usr) consists of one RZ29B disk and one (newer,
allegedly identical except for it being blue instead of gray)
RZ1B-CS (this is significant, bear with me).

we are trying to take the cluster from 4.0B to 4.0E. as a preliminary
they say we should upgrade the system firmware on the 4100s (it's
currently version 4.8, the newest is 5.3). but as soon as the firmware
is upgraded from the CD, we get swamped with SCSI CAM hard
errors at bootup. They come from the RZ1B-CS disk that's half
of the / mirrorset. it takes an hour (instead of maybe five minutes)
to boot up to multi-user because there are so many SCSI errors.
if we put firmware 4.8 back in, everything's fine again.

if we yank out the RZ1B-CS that gives the errors, the same errors
start coming from the other RZ1B-CS (the one that's part of /usr)
instead. what you see on the disk is the light flashing on and off real

slowly, totally unlike the fast flashing that indicates disk access.

compaq said the SRM firmware upgrade also upgrades the KZPDA
firmware on the quiet and it is this that is producing some kind of
conflict.

firmware 5.0, 5.1 and 5.2 have the same effect as 5.3. compaq
have tried replacing the KZPDA which has had no effect at all.
now the engineer is running round trying to find some old RZ29B-VW
disks to slot in instead of the RZ1B-CSes. compaq engineering
meanwhile are starting to clutch at straws, e.g. sending the engineer
to our site just to check that everything is terminated properly (it is;

furthermore its been running non-stop just fine for two years).

however we don't want to play too much with this system as it's
kinda critical; also replacing disks that are part of an LSM mirrored
system disk is a bit of a pain. we haven't touched the second alpha
is we can't afford any downtime on it.

we have patched 4.0B up to date (patch kit 9). i think that's all the
pertinent info. anyone seen anything at all like this? any suggestions
would be very welcome. thanks!

john speakman
memorial sloan-kettering cancer center, nyc
Received on Thu Apr 29 1999 - 19:01:48 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT