HP OpenVMS Systems Documentation |
OpenVMS Cluster Systems
C.4 Computer Fails to Join the Cluster
If a computer fails to join the cluster, follow the procedures in this
section to determine the cause.
To verify that OpenVMS Cluster software has been loaded, follow these instructions:
C.4.2 Verifying Boot Disk and RootTo verify that the computer has booted from the correct disk and system root, follow these instructions:
C.4.3 Verifying SCSNODE and SCSSYSTEMID ParametersTo be eligible to join a cluster, a computer must have unique SCSNODE and SCSSYSTEMID parameter values.
C.4.4 Verifying Cluster Security InformationTo verify the cluster group code and password, follow these instructions:
C.5 Startup Procedures Fail to CompleteIf a computer boots and joins the cluster but appears to hang before startup procedures complete---that is, before you are able to log in to the system---be sure that you have allowed sufficient time for the startup procedures to execute.
C.6 Diagnosing LAN Component FailuresSection D.5 provides troubleshooting techniques for LAN component failures (for example, broken LAN bridges). That appendix also describes techniques for using the Local Area OpenVMS Cluster Network Failure Analysis Program.
Intermittent LAN component failures (for example, packet loss) can
cause problems in the NISCA transport protocol that delivers System
Communications Services (SCS) messages to other nodes in the OpenVMS
Cluster. Appendix F describes troubleshooting techniques and
requirements for LAN analyzer tools.
Conditions like the following can cause a OpenVMS Cluster computer to suspend process or system activity (that is, to hang):
C.7.1 Cluster Quorum is LostThe OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS Cluster computers and ensures the integrity of shared cluster resources. (The quorum algorithm is described fully in Chapter 2.) Quorum is checked after any change to the cluster configuration---for example, when a voting computer leaves or joins the cluster. If quorum is lost, process and I/O activity on all computers in the cluster are blocked. Information about the loss of quorum and about clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each computer's operator console (OPA0), unless broadcast activity is explicitly disabled on that terminal. However, because quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OPA0 are the most reliable source of information about events that cause loss of quorum. If quorum is lost, you might add or reboot a node with additional votes.
Reference: See also the information about cluster
quorum in Section 10.12.
Access to shared cluster resources is coordinated by the distributed lock manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang. Occasionally, a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang. Access to files that contain data necessary for the operation of the system itself is coordinated by the distributed lock manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang. For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging in to an account with the same or similar user name or sending mail to that user name, is blocked until the original lock is released. Normally, this lock is released quickly, and users do not notice the locking operation.
However, if the process holding the lock is unable to proceed, other
processes could enter a wait state. Because the authorization file is
used during login and for most process creation operations (for
example, batch and network jobs), blocked processes could rapidly
accumulate in the cluster. Because the distributed lock manager is
functioning normally under these conditions, users are not notified by
broadcast messages or other means that a problem has occurred.
The operating system performs bugcheck operations only
when it detects conditions that could compromise normal system activity
or endanger data integrity. A CLUEXIT bugcheck is a
type of bugcheck initiated by the connection manager, the OpenVMS
Cluster software component that manages the interaction of cooperating
OpenVMS Cluster computers. Most such bugchecks are triggered by
conditions resulting from hardware failures (particularly failures in
communications paths), configuration errors, or system management
errors.
The most common conditions that result in CLUEXIT bugchecks are as follows: C.9 Port Communications
These sections provide detailed information about port communications
to assist in diagnosing port communication problems.
Shortly after a CI computer boots, the CI port driver (PADRIVER) begins configuration polling to discover other active ports on the CI. Normally, the poller runs every 5 seconds (the default value of the PAPOLLINTERVAL system parameter). In the first polling pass, all addresses are probed over cable path A; on the second pass, all addresses are probed over path B; on the third pass, path A is probed again; and so on. The poller probes by sending Request ID (REQID) packets to all possible port numbers, including itself. Active ports receiving the REQIDs return ID Received packet (IDREC) to the port issuing the REQID. A port might respond to a REQID even if the computer attached to the port is not running.
For OpenVMS Cluster systems communicating over the CI, DSSI, or a
combination of these interconnects, the port drivers perform a start
handshake when a pair of ports and port drivers has successfully
exchanged ID packets. The port drivers exchange datagrams containing
information about the computers, such as the type of computer and the
operating system version. If this exchange is successful, each computer
declares a virtual circuit open. An open virtual circuit is
prerequisite to all other activity.
For clusters that include Ethernet or FDDI interconnects, a multicast scheme is used to locate computers on the LAN. Approximately every 3 seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram message through each LAN adapter to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other computers. When the driver receives a HELLO datagram message from a computer with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO datagram messages received from a computer with a currently open virtual circuit indicate that the remote computer is operational.
A standard, three-message exchange handshake is used to create a
virtual circuit. The handshake messages contain information about the
transmitting computer and its record of the cluster password. These
parameters are verified at the receiving computer, which continues the
handshake only if its verification is successful. Thus, each computer
authenticates the other. After the final message, the virtual circuit
is opened for use by both computers.
System services such as the disk class driver, connection manager, and the MSCP and TMSCP servers communicate between computers with a protocol called System Communications Services (SCS). SCS is responsible primarily for forming and breaking intersystem process connections and for controlling flow of message traffic over those connections. SCS is implemented in the port driver (for example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the operating system called SCSLOA.EXE (loaded automatically during system initialization).
When a virtual circuit has been opened, a computer periodically probes
a remote computer for system services that the remote computer may be
offering. The SCS directory service, which makes known services that a
computer is offering, is always present both on computers and HSC
subsystems. As system services discover their counterparts on other
computers and HSC subsystems, they establish SCS connections to each
other. These connections are full duplex and are associated with a
particular virtual circuit. Multiple connections are typically
associated with a virtual circuit.
This section describes the hierarchy of communication paths and
describes where failures can occur.
Taken together, SCS, the port drivers, and the port itself support a hierarchy of communication paths. Starting with the most fundamental level, these are as follows:
C.10.2 Where Failures OccurFailures can occur at each communication level and in each component. Failures at one level translate into failures elsewhere, as described in Table C-3.
C.10.3 Verifying CI Port FunctionsBefore you boot in a cluster a CI connected computer that is new, just repaired, or suspected of having a problem, you should have Compaq services verify that the computer runs correctly on its own.
|