SUMMARY: DRD register failing against FC disks. from Jim Fitzmaurice on 2001-05-09 (tru64-unix-managers)

From: Jim Fitzmaurice <jpfitz_at_fnal.gov>
Date: Tue, 08 May 2001 09:44:32 -0500

Hello,

    Only one response, thank-you Dr. Blinn, he suspected what I had
suspected but was hoping for another answer. Anyway I disabled tagged files
on member1 and member2 and rebooted, then undid the setup phase on member3
and the messages stopped.

    That leaves me with a new problem... How do I do a rolling upgrade to
install patch_kit #3?

    I'm thinking that possibly running on the tagged files may have had
something to do with my problem. If this is the case I may be able to start
with member1 as the lead member instead of member3, that way member1 (the
node that was experiencing the problem) would never run on tagged files.
I'll give it a few days before I try out that theory.

    I've included my Original Question and Dr. Blinn's Response below.

----- Original Question -----

Hello managers,

    I've been looking around the Compaq web site for insight to these error
messages but haven't been having much luck. Maybe I'm looking in the wrong
place. First of all our hardware is a three member cluster consisting of a
GS80 and two 4100's, they are all connected via a memory channel hub. Also
on a shared SCSI is an HSZ50 for system and application disks, our databases
reside on an HSG80, which is shared via a fiber channel hub, using loop
topology. Finally we have several crates of third party Fiber Channel disks
also shared via a fiber channel hub, using loop topology. Software; we are
running Tru64 V5.1 patch_kit #2 on all but member03 which is the lead member
in a rolling upgrade to patch_kit #3, on TruCluster V5.1.

    Now the problem. Our /var/adm/messages file is filling up with these
messages:

May 4 12:16:20 mem01 vmunix: DRD register failed against 157 returned 61.
May 4 12:16:20 mem01 vmunix: DRD register failed against 159 returned 61.

   and occasionally you'll see a:

May 4 12:55:55 mem01 vmunix: DRD register failed against 157 returned 5.

    We've gotten them before, on all three cluster members, but only during
system boot, and never this many. They are occuring on member01 which was
booted first (the GS80). The other two members of the cluster (the 4100's)
got some of these during boot, but only a few, then they stopped, on
member01 they continue. The 157 and 159 are the HWID of two of the third
party Fiber Channel disks, the rest of them are shown as well but since we
have 40 and I didn't see a need to list all the error messages, or my mail
would have been a mile long. The disks are flashing as these messages occur.

    Further more, I can read data from these disk on all members without
problems, and writes occur at normal speeds on some of the disks, but even
the smallest write to some of the disks can take up to a minute to complete.

    Does anybody know what the DRD subsystem might be having trouble with?
Are these error messages documented anywhere? The errors usually stop,
this time why did they stop on the other members, but continue to occur on
this member? Can I stop them, and if so how?

    Any help would be appreciated.

Jim Fitzmaurice
jpfitz_at_fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.

--- Dr. Blinn's Response ---

What happens if you undo the roll on the lead member, or finish it onto the
other members? DRD is MESSY stuff, and it's possible that something in
the new patch kit is incompatible with something in the old patch kit; there
is neither enough time nor enough people nor enough equipment to do all of
the testing needed to assure that something subtle like this doesn't wind up
in a weird state during the roll (if they test the rolling upgrade at all,
it's not going to be with your third-party FC disks, which sound like they
are a key part of the scenario, and they aren't going to run the cluster
very long in a partly rolled state, which is where you are, and that's if
they really do any TCS rolls with FC disks at all).

If it were MY cluster, I'd back out of the rolling upgrade and see whether
the symptoms go away; I wouldn't trust completing the roll to leave me with
a working cluster, and undoing the roll can be more tedious the further in
the process you go.

Tom

  Dr. Thomas P. Blinn + UNIX Software Group + Compaq Computer Corporation
   110 Spit Brook Road, MS ZKO3-2/W17 Nashua, New Hampshire 03062-2698
    Technology Partnership Engineering Phone: (603) 884-0646
     Internet: tpb_at_zk3.dec.com - or - thomas.blinn_at_compaq.com
      ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net

   Worry kills more people than work because more people worry than work.

       Keep your stick on the ice. -- Steve Smith ("Red Green")

      My favorite palindrome is: Satan, oscillate my metallic sonatas.
                               -- Phil Agre, pagre_at_alpha.oac.ucla.edu

      Yesterday it worked / Today it is not working / UNIX is like that
-- apologies to Margaret Segall

   Opinions expressed herein are my own, and do not necessarily represent
   those of my employer or anyone else, living or dead, real or imagined.
Received on Tue May 08 2001 - 14:43:44 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT