HP OpenVMS Systems Documentation

OpenVMS Cluster Systems

Chapter 2
OpenVMS Cluster Concepts

To help you understand the design and implementation of an OpenVMS Cluster system, this chapter describes its basic architecture.

2.1 OpenVMS Cluster System Architecture

Figure 2-1 illustrates the protocol layers within the OpenVMS Cluster system architecture, ranging from the communications mechanisms at the base of the figure to the users of the system at the top of the figure. These protocol layers include:

Ports
System Communications Services (SCS)
System Applications (SYSAPs)
Other layered components

Figure 2-1 OpenVMS Cluster System Architecture

2.1.1 Port Layer

This lowest level of the architecture provides connections, in the form of communication ports and physical paths, between devices. The port layer can contain any of the following interconnects:

LANs
- ATM
- Ethernet (10/100 and Gigabit Ethernet)
- FDDI
CI
DSSI
MEMORY CHANNEL
SCSI
Fibre Channel

Each interconnect is accessed by a port (also referred to as an adapter) that connects to the processor node. For example, the Fibre Channel interconnect is accessed by way of a Fibre Channel port. The interconnects are discussed in Chapter 3.

2.1.2 SCS Layer

The SCS layer provides basic connection management and communications services in the form of datagrams, messages, and block transfers over each logical path. Table 2-1 describes these services.

**Table 2-1 Communications Services**
Service	Delivery Guarantees	Usage
Datagrams
Information units of less than one packet	Delivery of datagrams is not guaranteed. Datagrams can be lost, duplicated, or delivered out of order.	Status and information messages whose loss is not critical Applications that have their own reliability protocols (such as DECnet)
Messages
Information units of less than one packet	Messages are guaranteed to be delivered and to arrive in order. Virtual circuit sequence numbers are used on the individual packets.	Disk read and write requests
Block data transfers
Any contiguous data in a process virtual address space. There is no size limit except that imposed by the physical memory constraints of the host system.	Delivery of block data is guaranteed. The sending and receiving ports and the port emulators cooperate in breaking the transfer into data packets and ensuring that all packets are correctly transmitted, received, and placed in the appropriate destination buffer. Block data transfers differ from messages in the size of the transfer.	Disk subsystems and disk servers to move data associated with disk read and write requests

The SCS layer is implemented as a combination of hardware and software, or software only, depending upon the type of port. SCS manages connections in an OpenVMS Cluster and multiplexes messages between system applications over a common transport called a virtual circuit. A virtual circuit exists between each pair of SCS ports and a set of SCS connections that are multiplexed on that virtual circuit.

2.1.3 System Applications (SYSAPs) Layer

The next higher layer in the OpenVMS Cluster architecture consists of the SYSAPs layer. This layer consists of multiple system applications that provide, for example, access to disks and tapes and cluster membership control. SYSAPs can include:

Connection manager
MSCP server
TMSCP server
Disk and tape class drivers

These components are described in detail later in this chapter.

2.1.4 Other Layered Components

A wide range of OpenVMS components layer on top of the OpenVMS Cluster system architecture, including:

Volume Shadowing for OpenVMS
Distributed lock manager
Process control services
Distributed file system
Record Management Services (RMS)
Distributed job controller

These components, except for volume shadowing, are described in detail later in this chapter. Volume Shadowing for OpenVMS is described in Section 6.6.

2.2 OpenVMS Cluster Software Functions

The OpenVMS Cluster software components that implement OpenVMS Cluster communication and resource-sharing functions always run on every computer in the OpenVMS Cluster. If one computer fails, the OpenVMS Cluster system continues operating, because the components still run on the remaining computers.

2.2.1 Functions

The following table summarizes the OpenVMS Cluster communication and resource-sharing functions and the components that perform them.

Function	Performed By
Ensure that OpenVMS Cluster computers communicate with one another to enforce the rules of cluster membership	Connection manager
Synchronize functions performed by other OpenVMS Cluster components, OpenVMS products, and other software components	Distributed lock manager
Share disks and files	Distributed file system
Make disks available to nodes that do not have direct access	MSCP server
Make tapes available to nodes that do not have direct access	TMSCP server
Make queues available	Distributed job controller

2.3 Ensuring the Integrity of Cluster Membership

The connection manager ensures that computers in an OpenVMS Cluster system communicate with one another to enforce the rules of cluster membership.

Computers in an OpenVMS Cluster system share various data and system resources, such as access to disks and files. To achieve the coordination that is necessary to maintain resource integrity, the computers must maintain a clear record of cluster membership.

2.3.1 Connection Manager

The connection manager creates an OpenVMS Cluster when the first computer is booted and reconfigures the cluster when computers join or leave it during cluster state transitions. The overall responsibilities of the connection manager are to:

Prevent partitioning (see Section 2.3.2).
Track which nodes in the OpenVMS Cluster system are active and which are not.
Deliver messages to remote nodes.
Remove nodes.
Provide a highly available message service in which other software components, such as the distributed lock manager, can synchronize access to shared resources.

2.3.2 Cluster Partitioning

A primary purpose of the connection manager is to prevent cluster partitioning, a condition in which nodes in an existing OpenVMS Cluster configuration divide into two or more independent clusters.

Cluster partitioning can result in data file corruption because the distributed lock manager cannot coordinate access to shared resources for multiple OpenVMS Cluster systems. The connection manager prevents cluster partitioning using a quorum algorithm.

2.3.3 Quorum Algorithm

The quorum algorithm is a mathematical method for determining if a majority of OpenVMS Cluster members exist so resources can be shared across an OpenVMS Cluster system. Quorum is a dynamic value calculated by the connection manager to prevent cluster partitioning. The connection manager allows processing to occur only if a majority of the OpenVMS Cluster members are functioning.

2.3.4 System Parameters

Two system parameters, VOTES and EXPECTED_VOTES, are key to the computations performed by the quorum algorithm. The following table describes these parameters.

Parameter

Description

VOTES

Specifies a fixed number of votes that a computer contributes toward quorum. The system manager can set the VOTES parameters on each computer or allow the operating system to set it to the following default values:

For satellite nodes, the default value is 0.
For all other computers, the default value is 1.

Each Alpha or VAX computer with a nonzero value for the VOTES system parameter is considered a voting member.

EXPECTED_VOTES

Specifies the sum of all VOTES held by OpenVMS Cluster members. The initial value is used to derive an estimate of the correct quorum value for the cluster. The system manager must set this parameter on each active Alpha or VAX computer, including satellites, in the cluster.

2.3.5 Calculating Cluster Votes

The quorum algorithm operates as follows:

Step

Action

When nodes in the OpenVMS Cluster boot, the connection manager uses the largest value for EXPECTED_VOTES of all systems present to derive an estimated quorum value according to the following formula:

Estimated quorum = (EXPECTED_VOTES + 2)/2 | Rounded down

During a state transition, the connection manager dynamically computes the cluster quorum value to be the maximum of the following:

The current cluster quorum value
The largest of the values calculated from the following formula, where the EXPECTED_VOTES value is largest value specified by any node in the cluster:
```
QUORUM = (EXPECTED_VOTES + 2)/2 | Rounded down
```
The value calculated from the following formula, where the VOTES system parameter is the total votes held by all cluster members:
```
QUORUM = (VOTES + 2)/2 | Rounded down
```

The cluster state transitions that cause cluster quorum to be recalculated occur when a computer joins the cluster and when the cluster recognizes a quorum disk (quorum disks are discussed in Section 2.3.7.)

The connection manager compares the cluster votes value to the cluster quorum value and determines what action to take based on the following conditions:

WHEN...	THEN...
The total number of cluster votes is equal to at least the quorum value	The OpenVMS Cluster system continues running.
The current number of cluster votes drops below the quorum value (because of computers leaving the cluster)	The remaining OpenVMS Cluster members suspend all process activity and all I/O operations to cluster-accessible disks and tapes until sufficient votes are added (that is, enough computers have joined the OpenVMS Cluster) to bring the total number of votes to a value greater than or equal to quorum.

Note: When a node leaves the OpenVMS Cluster system, the connection manager does not decrease the cluster quorum value. In fact, the connection manager never decreases the cluster quorum value; it only increases it. However, system managers can decrease the value according to the instructions in Section 8.6.2.

2.3.6 Example

Consider a cluster consisting of three computers, each computer having its VOTES parameter set to 1 and its EXPECTED_VOTES parameter set to 3. The connection manager dynamically computes the cluster quorum value to be 2 (that is, (3 + 2)/2). In this example, any two of the three computers constitute a quorum and can run in the absence of the third computer. No single computer can constitute a quorum by itself. Therefore, there is no way the three OpenVMS Cluster computers can be partitioned and run as two independent clusters.

2.3.7 Quorum Disk

A cluster system manager can designate a disk a quorum disk. The quorum disk acts as a virtual cluster member whose purpose is to add one vote to the total cluster votes. By establishing a quorum disk, you can increase the availability of a two-node cluster; such configurations can maintain quorum in the event of failure of either the quorum disk or one node, and continue operating.

Note: Setting up a quorum disk is recommended for OpenVMS Cluster configurations with only two nodes. A quorum disk is not necessary for configurations with more than two nodes.

For example, assume an OpenVMS Cluster configuration with many satellites (that have no votes) and two nonsatellite systems (each having one vote) that downline load the satellites. Quorum is calculated as follows:

 (EXPECTED VOTES + 2)/2 = (2 + 2)/2 = 2

Because there is no quorum disk, if either of the nonsatellite systems depart from the cluster, only one vote remains and cluster quorum is lost. Activity will be blocked throughout the cluster until quorum is restored.

However, if the configuration includes a quorum disk (adding one vote to the total cluster votes), and the EXPECTED_VOTES parameter is set to 3 on each node, then quorum will still be 2 even if one of the nodes leaves the cluster. Quorum is calculated as follows:

 (EXPECTED VOTES + 2)/2 = (3 + 2)/2 = 2

Rules: Each OpenVMS Cluster system can include only one quorum disk. At least one computer must have a direct (not served) connection to the quorum disk:

Any computers that have a direct, active connection to the quorum disk or that have the potential for a direct connection should be enabled as quorum disk watchers.
Computers that cannot access the disk directly must rely on the quorum disk watchers for information about the status of votes contributed by the quorum disk.

Reference: For more information about enabling a quorum disk, see Section 8.2.4. Section 8.3.2 describes removing a quorum disk.

2.3.8 Quorum Disk Watcher

To enable a computer as a quorum disk watcher, use one of the following methods:

Method

Perform These Steps

Run the CLUSTER_CONFIG.COM procedure
(described in Chapter 8)

Invoke the procedure and:

Select the CHANGE option.
From the CHANGE menu, select the item labeled "Enable a quorum disk on the local computer".
At the prompt, supply the quorum disk device name.

The procedure uses the information you provide to update the values of the DISK_QUORUM and QDSKVOTES system parameters.

Respond YES when the OpenVMS installation procedure asks whether the cluster will contain a quorum disk
(described in Chapter 4)

During the installation procedure:

Answer Y when the the procedure asks whether the cluster will contain a quorum disk.
At the prompt, supply the quorum disk device name.

The procedure uses the information you provide to update the values of the DISK_QUORUM and QDSKVOTES system parameters.

Edit the
MODPARAMS or AGEN$ files (described in Chapter 8)

Edit the following parameters:

DISK_QUORUM: Specify the quorum disk name, in ASCII, as a value for the DISK_QUORUM system parameter.
QDSKVOTES: Set an appropriate value for the QDSKVOTES parameter. This parameter specifies the number of votes contributed to the cluster votes total by a quorum disk. The number of votes contributed by the quorum disk is equal to the smallest value of the QDSKVOTES parameter on any quorum disk watcher.

Hint: If only one quorum disk watcher has direct access to the quorum disk, then remove the disk and give its votes to the node.

2.3.9 Rules for Specifying Quorum

For the quorum disk's votes to be counted in the total cluster votes, the following conditions must be met:

On all computers capable of becoming watchers, you must specify the same physical device name as a value for the DISK_QUORUM system parameter. The remaining computers (which must have a blank value for DISK_QUORUM) recognize the name specified by the first quorum disk watcher with which they communicate.
At least one quorum disk watcher must have a direct, active connection to the quorum disk.
The disk must contain a valid format file named QUORUM.DAT in the master file directory. The QUORUM.DAT file is created automatically after a system specifying a quorum disk has booted into the cluster for the first time. This file is used on subsequent reboots.
Note: The file is not created if the system parameter STARTUP_P1 is set to MIN.
To permit recovery from failure conditions, the quorum disk must be mounted by all disk watchers.
The OpenVMS Cluster can include only one quorum disk.
The quorum disk cannot be a member of a shadow set.

Hint: By increasing the quorum disk's votes to one less than the total votes from both systems (and by increasing the value of the EXPECTED_VOTES system parameter by the same amount), you can boot and run the cluster with only one node.

2.4 State Transitions

OpenVMS Cluster state transitions occur when a computer joins or leaves an OpenVMS Cluster system and when the cluster recognizes a quorum disk state change. The connection manager controls these events to ensure the preservation of data integrity throughout the cluster.

A state transition's duration and effect on users (applications) are determined by the reason for the transition, the configuration, and the applications in use.

2.4.1 Adding a Member

Every transition goes through one or more phases, depending on whether its cause is the addition of a new OpenVMS Cluster member or the failure of a current member.

Table 2-2 describes the phases of a transition caused by the addition of a new member.

**Table 2-2 Transitions Caused by Adding a Cluster Member**
Phase	Description
New member detection	Early in its boot sequence, a computer seeking membership in an OpenVMS Cluster system sends messages to current members asking to join the cluster. The first cluster member that receives the membership request acts as the new computer's advocate and proposes reconfiguring the cluster to include the computer in the cluster. While the new computer is booting, no applications are affected. Note: The connection manager will not allow a computer to join the OpenVMS Cluster system if the node's value for EXPECTED_VOTES would readjust quorum higher than calculated votes to cause the OpenVMS Cluster to suspend activity.
Reconfiguration	During a configuration change due to a computer being added to an OpenVMS Cluster, all current OpenVMS Cluster members must establish communications with the new computer. Once communications are established, the new computer is admitted to the cluster. In some cases, the lock database is rebuilt.

2.4.2 Losing a Member

Table 2-3 describes the phases of a transition caused by the failure of a current OpenVMS Cluster member.

Table 2-3 Transitions Caused by Loss of a Cluster Member

Cause

Description

Failure detection

The duration of this phase depends on the cause of the failure and on how the failure is detected.

During normal cluster operation, messages sent from one computer to another are acknowledged when received.

IF...	THEN...
A message is not acknowledged within a period determined by OpenVMS Cluster communications software	The repair attempt phase begins.
A cluster member is shut down or fails	The operating system causes datagrams to be sent from the computer shutting down to the other members. These datagrams state the computer's intention to sever communications and to stop sharing resources. The failure detection and repair attempt phases are bypassed, and the reconfiguration phase begins immediately.

Repair attempt

If the virtual circuit to an OpenVMS Cluster member is broken, attempts are made to repair the path. Repair attempts continue for an interval specified by the PAPOLLINTERVAL system parameter. (System managers can adjust the value of this parameter to suit local conditions.) Thereafter, the path is considered irrevocably broken, and steps must be taken to reconfigure the OpenVMS Cluster system so that all computers can once again communicate with each other and so that computers that cannot communicate are removed from the OpenVMS Cluster.

Reconfiguration

If a cluster member is shut down or fails, the cluster must be reconfigured. One of the remaining computers acts as coordinator and exchanges messages with all other cluster members to determine an optimal cluster configuration with the most members and the most votes. This phase, during which all user (application) activity is blocked, usually lasts less than 3 seconds, although the actual time depends on the configuration.

OpenVMS Cluster system recovery

Recovery includes the following stages, some of which can take place in parallel:

Stage

Action

I/O completion

When a computer is removed from the cluster, OpenVMS Cluster software ensures that all I/O operations that are started prior to the transition complete before I/O operations that are generated after the transition. This stage usually has little or no effect on applications.

Lock database rebuild

Because the lock database is distributed among all members, some portion of the database might need rebuilding. A rebuild is performed as follows:

WHEN...	THEN...
A computer leaves the OpenVMS Cluster	A rebuild is always performed.
A computer is added to the OpenVMS Cluster	A rebuild is performed when the LOCKDIRWT system parameter is greater than 1.

Caution: Setting the LOCKDIRWT system parameter to different values on the same model or type of computer can cause the distributed lock manager to use the computer with the higher value. This could cause undue resource usage on that computer.

Disk mount verification

This stage occurs only when the failure of a voting member causes quorum to be lost. To protect data integrity, all I/O activity is blocked until quorum is regained. Mount verification is the mechanism used to block I/O during this phase.

Quorum disk votes validation

If, when a computer is removed, the remaining members can determine that it has shut down or failed, the votes contributed by the quorum disk are included without delay in quorum calculations that are performed by the remaining members. However, if the quorum watcher cannot determine that the computer has shut down or failed (for example, if a console halt, power failure, or communications failure has occurred), the votes are not included for a period (in seconds) equal to four times the value of the QDSKINTERVAL system parameter. This period is sufficient to determine that the failed computer is no longer using the quorum disk.

Disk rebuild

If the transition is the result of a computer rebooting after a failure, the disks are marked as improperly dismounted.

Reference: See Sections 6.5.5 and 6.5.6 for information about rebuilding disks.

Application recovery

When you assess the effect of a state transition on application users, consider that the application recovery phase includes activities such as replaying a journal file, cleaning up recovery units, and users logging in again.

Contents

Index