Cluster member problem during a clu_upgrade.

From: Jim Fitzmaurice <jpfitz_at_fnal.gov>
Date: Fri, 29 Jun 2001 14:04:17 -0500

Hello,

    System Info: Three member TruCluster 5.1:
Member 1 = GS80 4 - 733MHz CPUs running - Tru64 5.1 pk 3
Member 2 = 4100 2 - 600MHz CPUs running - Tru64 5.1 pk 3
Member 3 = 4100 2 - 433MHz CPUs running - Tru64 5.1 pk 3

    A while back I did a clu_upgrade. Everything went fine except before I
was able to run clu_upgrade clean, Member 3 developed a mysterious problem
and became unstable. It was down for a couple weeks, but it was eventually
diagnosed as an intermittent memory problem and repaired.

    Now bringing it back up into the cluster causes all sorts of problems.
First during Member 3's boot up, it mentioned it was trying to HUP inetd.
The system seemed to go away for about 10 minutes, getting a little
frustrated I hit <Ctrl>-<C> and booting continued, at another point it was
trying to mount NFS file systems, this was unnecessary since the other
members had already mounted the NFS file systems, so after letting it sit
for about 5 minutes I hit <Ctrl>-<C> and it continued to boot. There were
various messages during the boot complaining about member 3 NIS can't bind
to the cluster, and when it was up none of the network cards worked. On the
other two members commands like ps -ef takes 60-90 seconds to complete, and
ls -l on any of the drives also takes 60-90 seconds even if there are only a
few files in the directory. Member 3 has no problems running these commands.
I can ping in and out of Member 3 over mc0 but the three network cards tu0,
alt0, and alt1 do not work. I ran ifconfig and it says they are up and the
routes look fine, but no communications with anything on the network and I
can only talk to the other members through mc0.

    clu_get_info said all the members were up, so I decided to try the
clu_upgrade clean, from Member 3. It failed saying the Member 3 was in
single user mode and I needed to boot it to multi-user mode to run the
command. clu_upgrade -v gave me the same error. All members of the cluster
gave the same error on both commands. The first thing I checked was to make
sure cfgmgr was uncommented in inetd.conf, and it wasn't, but I suppose this
didn't matter as the network cards weren't working anyway.

    My plan is to remove Member 3 from the cluster, then run my clu_upgrade
clean, load a couple manual patches to fix a couple other unrelated software
issues, then re-add Member 3 to the cluster.

    Will this work? Will clu_upgrade clean work if a member is removed? Or
am I destined to be stuck forever with "tagged" files on my cluster, never
able to run another upgrade?

Thanks,

Jim Fitzmaurice
jpfitz_at_fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.
Received on Fri Jun 29 2001 - 19:04:40 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT