Hello,
I have a 2 system TruCluster with a GS80 and a 4100, they share an HSZ50
and we have just attached an HSG80. The problem is with the newly attached
HSG80. The HSZ50 has all the System cluster filesytems and directories like
/home, etc.... It was used to build the cluster and works great. Now we are
adding an HSZ80 which we plan on using for our databases. This isn't going
as well as I'd like.
A possible cause of the problem could have been when I added the second
member, the 4100, to the cluster. The HSG was being used on a different 4100
(One that is not part of the cluster at this time.) but, the HSG was also
hooked up to the 4100 that I was adding to the cluster, even though it
wasn't using it. I realize now that I should have disconnected that 4100
from the HSG80, but since I wasn't using it on the system I'd forgotten it
was attached. Well the when I booted the 4100 into the cluster it went out
and grabbed all the devices attached to it for the cluster. This caused some
problems on the other 4100 since it lost most of its AdvFS when the newly
clustered system made it's connection to the HSG online and placed the other
systems connection offline. After a while we were able to correct that
situation, by disconnecting the clustered system and brining the other
systems connection back online.
Now we have migrated all the data off the HSG80 and have attached it to
the cluster systems. Unfortunately it showed devices dsk35c to dsk47c that's
two more devices than we have on the array. Running disklabel on all the
devices showed that dsk35c to dsk39c and dsk42c to dsk47c. to be the correct
devices. The devices dsk40c & dsk41c are not valid. So I wanted to remove
the devices using hwmgr -delete and re-add them using the hwmgr -scan to get
rid of the two devices in the middle. This is to avoid causing any confusion
in the future. I was able to successfully remove all the HSG devices, EXCEPT
for those two! I get the following error:
# hwmgr -delete component -id 244
hwmgr: Error (95) Cannot start operation.
I discovered the devices were inconsistent by running the command:
# hwmgr -show component -inconsisten
HWID: HOSTNAME FLAGS SERVICE COMPONENT NAME
-----------------------------------------------
244: d0olc rcd-i iomap
SCSI-WWID:01000010:6000-1fe1-0008-4b30-0009-0300-5108-0044
245: d0olc rcd-i iomap
SCSI-WWID:01000010:6000-1fe1-0008-4b30-0009-0300-5108-0045
252: d0olc rcdsi none
SCSI-WWID:02000008:5000-1fe1-0008-4b30
I assume HWID: 252 the controller:
252: /dev/cport/scp3 HSG80CCL bus-4-targ-0-lun-0
is showing up as inconsistent because the two inconsistent devices are
attached to it. (I can't delete that device (252) either, same error.) The
only documentation I found was in the hwmgr man page where it states:
Note that this command does not fix database inconsistencies; it
only detects inconsistencies. One possible fix may be to reboot
the cluster.
<sarcasm-mode-on>
My, doesn't that sound reassuring... What a definitive answer... Surely
there is no need to direct the reader to another document or resource where
he might find more information....
<sarcasm-mode-off>
(Sorry about that.) Anyway I tried the hwmgr -refresh command but that
didn't work either.
A search for the error on the Compaq site revealed 22,229 page matches I
tried a couple of documents but searches of the documents revealed no
information on that error. (Nope, sorry, I am NOT going to search all 22,229
matches.)
System Info:
Two member cluster consisting of a GS80 and a 4100, and HSZ50 and an
HSG80.
Running Tru64 V5.1 and TruCluster V5.1patch kit 2.
Has anybody seen this before, or direct me to some documentation before
I blindly reboot my cluster?
Jim Fitzmaurice
jpfitz_at_fnal.gov
UNIX is very user friendly, It's just very particular about who it makes
friends with.
Received on Mon Mar 19 2001 - 18:16:49 NZST