![]() |
![]() HP OpenVMS Systemsask the wizard |
![]() |
The Question is: I have a multi-site FDDI VAXcluster: Site A has 2 Vx7000 and a SW800 in CI configuration and Site B has 7 Vax7000 anda SW800 also in CI configuration. All servers uses FDDI as cluster interconnect and Ethernet as network access. Each server contribute 1 v ote and quarum is 3. Site A has 2 votes and Site B has 3. I had a power outage recently and both the FDDI and Network switches were downed. All servers and storages were up except for 1 of the 3 server in Site B. The entire cluster hung-up for a while and later all SW800 volumes were software disabled. When power was restored and FDDI and Network switches were power up, and the down server was also powered up but fail to boot completely due to not able to mount th e disabled volumes. I had to shutdown - forced shutdown - all the servers and reboot all of them to bring the cluster back up. Question: 1) Losing both FDDI and Network connectivity will cause the entire cluster to hang? 2) If Site A has high total votes (say 4) then Site B (3 currently), would Site A server still hang up in this scenario? 3) In what sitiation wuold cluster partitioning occur? You asnwers to these and all related input the these power outage problem would be greatly appreciated. Thank you. The Answer is : Your stated facts are unclear or contradictory. You first claim 2 nodes at Site A and 7 at Site B, each with 1 vote -- but then say there are only 3 votes at Site B. Apparently, only 3 of the systems at Site B have 1 vote each and the other 4 have 0 votes. The total of 5 votes matches the quorum of 3. Perhaps the 7 was a typo? The power outage hit both sites concurrently? If so, the OpenVMS Wizard suspects only one site had a power problem but that it isolated the sites from each other. Site A systems should enter a quorum hang (block) as the lobe only has 2 votes together. Site B systems would have stayed online if all 3 servers had been up. Why was the one system down at Site B? With the one system down at Site B, the two remaining systems also trigger a quorum hang. If all of the cluster interconnects go down, then the isolated nodes would all encounter quorum hangs (blocks) as none of the nodes would have the 3 votes by themselves. This is an expected behavior. The system parameter MVTIMEOUT specifies how long a volume can be offline until it is made unavailable. After mount verify timeout occurs, only a dismount and can make the volume available again. If the volume comes back online before MVTIMEOUT, then the stalled I/O's are simply reissued and the applications pick up from where they left off. You could consider increasing MVTIMEOUT to tolerate longer temporary outages. Although you might have been able to recover without rebooting all of the systems in the cluster it was probably easier to do so. Usually when the volume is made unavailable, the stalled I/O's fail back to the application which report the error and exit. Typically, you can dismount the volumes and remount them. The problem comes when an application keeps a channel open to the volume despite the state change. Finding and stopping all such applications cluster-wide can be tedious. Answers: 1: Not necessarily. Having the one system down at Site B was the problem. 2: No. With 4+3=7, the total votes would require quorum to be set to 4. Site A would stay online (if all systems there were up). 3: Cluster partitioning -- where both sides are online, despite being unconnected -- happens most often when EXPECTED_VOTES is misconfigured. It can also happen when a system manager forces the blocked systems to recalculate quorum dynamically. (For more details on VOTES and EXPECTED_VOTES, please see the OpenVMS FAQ.)
|