Previous | Contents | Index |
Section D.5 provides troubleshooting techniques for LAN component failures (for example, broken LAN bridges). That appendix also describes techniques for using the Local Area OpenVMS Cluster Network Failure Analysis Program.
Intermittent LAN component failures (for example, packet loss) can
cause problems in the NISCA transport protocol that delivers System
Communications Services (SCS) messages to other nodes in the OpenVMS
Cluster. Appendix F describes troubleshooting techniques and
requirements for LAN analyzer tools.
C.6 Diagnosing Cluster Hangs
Conditions like the following can cause a OpenVMS Cluster computer to suspend process or system activity (that is, to hang):
Condition | Reference |
---|---|
Cluster quorum is lost. | Section C.6.1 |
A shared cluster resource is inaccessible. | Section C.6.2 |
The OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS Cluster computers and ensures the integrity of shared cluster resources. (The quorum algorithm is described fully in Chapter 2.) Quorum is checked after any change to the cluster configuration---for example, when a voting computer leaves or joins the cluster. If quorum is lost, process and I/O activity on all computers in the cluster are blocked.
Information about the loss of quorum and about clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each computer's operator console (OPA0), unless broadcast activity is explicitly disabled on that terminal. However, because quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OPA0 are the most reliable source of information about events that cause loss of quorum.
If quorum is lost, you might add or reboot a node with additional votes.
Reference: See also the information about cluster
quorum in Section 10.11.
C.6.2 Inaccessible Cluster Resource
Access to shared cluster resources is coordinated by the distributed lock manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang.
Occasionally, a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang.
Access to files that contain data necessary for the operation of the system itself is coordinated by the distributed lock manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang.
For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging in to an account with the same or similar user name or sending mail to that user name, is blocked until the original lock is released. Normally, this lock is released quickly, and users do not notice the locking operation.
However, if the process holding the lock is unable to proceed, other
processes could enter a wait state. Because the authorization file is
used during login and for most process creation operations (for
example, batch and network jobs), blocked processes could rapidly
accumulate in the cluster. Because the distributed lock manager is
functioning normally under these conditions, users are not notified by
broadcast messages or other means that a problem has occurred.
C.7 Diagnosing CLUEXIT Bugchecks
The operating system performs bugcheck operations only
when it detects conditions that could compromise normal system activity
or endanger data integrity. A CLUEXIT bugcheck is a
type of bugcheck initiated by the connection manager, the OpenVMS
Cluster software component that manages the interaction of cooperating
OpenVMS Cluster computers. Most such bugchecks are triggered by
conditions resulting from hardware failures (particularly failures in
communications paths), configuration errors, or system management
errors.
C.7.1 Conditions Causing Bugchecks
The most common conditions that result in CLUEXIT bugchecks are as follows:
These sections provide detailed information about port communications
to assist in diagnosing port communication problems.
C.8.1 LAN Communications
For clusters that include Ethernet or FDDI interconnects, a multicast scheme is used to locate computers on the LAN. Approximately every 3 seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram message through each LAN adapter to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other computers. When the driver receives a HELLO datagram message from a computer with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO datagram messages received from a computer with a currently open virtual circuit indicate that the remote computer is operational.
A standard, three-message exchange handshake is used to create a
virtual circuit. The handshake messages contain information about the
transmitting computer and its record of the cluster password. These
parameters are verified at the receiving computer, which continues the
handshake only if its verification is successful. Thus, each computer
authenticates the other. After the final message, the virtual circuit
is opened for use by both computers.
C.8.2 System Communications Services (SCS) Connections
System services such as the disk class driver, connection manager, and the MSCP and TMSCP servers communicate between computers with a protocol called System Communications Services (SCS). SCS is responsible primarily for forming and breaking intersystem process connections and for controlling flow of message traffic over those connections. SCS is implemented in the port driver (for example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the operating system called SCSLOA.EXE (loaded automatically during system initialization).
When a virtual circuit has been opened, a computer periodically probes
a remote computer for system services that the remote computer may be
offering. The SCS directory service, which makes known services that a
computer is offering, is always present both on computers and HSC
subsystems. As system services discover their counterparts on other
computers and HSC subsystems, they establish SCS connections to each
other. These connections are full duplex and are associated with a
particular virtual circuit. Multiple connections are typically
associated with a virtual circuit.
C.9 Diagnosing Port Failures
This section describes the hierarchy of communication paths and
describes where failures can occur.
C.9.1 Hierarchy of Communication Paths
Taken together, SCS, the port drivers, and the port itself support a hierarchy of communication paths. Starting with the most fundamental level, these are as follows:
Failures can occur at each communication level and in each component. Failures at one level translate into failures elsewhere, as described in Table C-3.
Communication Level | Failures |
---|---|
Wires | If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports. |
Virtual circuit |
If no path works between a pair of ports, the virtual circuit fails and
is closed. A path failure is discovered for the LAN, when no multicast
HELLO datagram message or incoming traffic is received from another
computer.
When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected. |
LAN adapter | If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken. |
SCS connection | When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer. |
Computer | If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer. |
To diagnose communication problems, you can invoke the Show Cluster utility using the instructions in Table C-4.
Step | Action | What to Look for | ||||||
---|---|---|---|---|---|---|---|---|
1 | Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT,CABLE_STATUS. This command adds a class of information about all the virtual circuits as seen from the computer on which you are running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for the circuit from the CI interface on the local system to the CI interface on the remote system. |
Primarily, you are checking whether there is a virtual circuit in the
OPEN state to the failing computer. Common causes of failure to open a
virtual circuit and keep it open are the following:
|
||||||
2 |
Run SHOW CLUSTER from each active computer in the cluster to verify
whether each computer's view of the failing computer is consistent with
every other computer's view.
|
If no virtual circuit is open to the failing computer, check the bottom
of the SHOW CLUSTER display:
|
The Local Area OpenVMS Cluster Network Failure Analysis Program described in Section D.4 uses the HELLO datagram messages to verify continuously the network paths (channels) used by PEDRIVER. This verification process, combined with physical description of the network, can:
Monitoring events recorded in the error log can help you anticipate and
avoid potential problems. From the total error count (displayed by the
DCL command SHOW DEVICES device-name), you can determine
whether errors are increasing. If so, you should examine the error log.
C.10.1 Examine the Error Log
The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to report the contents of an error-log file.
Reference: For more information about the Error Log utility, see the HP OpenVMS System Management Utilities Reference Manual.
Some error-log entries are informational only while others require action.
Error Type | Action Required? | Purpose |
---|---|---|
Informational error-log entries require no action. For
example, if you shut down a computer in the cluster, all other active
computers that have open virtual circuits between themselves and the
computer that has been shut down make entries in their error logs. Such
computers record up to three errors for the event:
|
No | These messages are normal and reflect the change of state in the circuits to the computer that has been shut down. |
Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions. | Yes | Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths). |
Errors and other events on LAN cause port drivers to enter information in the system error log in one of two formats:
Section C.10.4 describe those formats.
C.10.3 LAN Device-Attention Entries
Example C-1 shows device-attention entries for the LAN.
Example C-1 LAN Device-Attention Entry |
---|
**** V3.4 ********************* ENTRY 337 ******************************** (1) Logging OS 1. OpenVMS System Architecture 2. Alpha OS version XC56-BL2 Event sequence number 96. Timestamp of occurrence 16-SEP-2009 16:33:03 (2) Time since reboot 0 Day(s) 0:50:08 Host name PERK System Model AlphaServer ES45 Model 2 (3) Entry Type 98. Asynchronous Device Attention ---- Device Profile ---- Unit PERK$PEA0 (4) Product Name NI-SCA Port ---- NISCA Port Data ---- Error Type and SubType x0700 Device Error, Fatal Error Detected by Datalink(5) Status x0000120100000001 (6) Datalink Device Name EIA2: (7) Remote Node Name (8) Remote Address x0000000000000000 (9) Local Address x000063B4000400AA (10) Error Count 1. Error Occurrences This Entry (11) ----- Software Info ----- UCB$x_ERRCNT 2. Errors This Unit |
The following table describes the LAN device-attention entries in Example C-1.
Entry | Description |
---|---|
(1) | The four lines are the entry heading. These lines contain the number of the entry in this error log file, the architecture, the OS version and the sequence number of this error. Each entry in the log file contains such a heading. |
(2) | This line contains the date and time. |
(3) | The next two lines contain the system model and the entry type. |
(4) | This line shows the name of the subsystem and component that caused the entry. |
(5) | This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible. |
(6) | The first longword shows the I/O completion status returned by the LAN driver. The second longword is the VCI event code delivered to PEDRIVER by the LAN driver. |
(7) | DATALINK NAME is the name of the LAN device on which the error occurred. |
(8) | REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error. |
(9) | REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error. |
(10) | LOCAL ADDR is the LAN address of the local node. |
(11) | ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry. |
Previous | Next | Contents | Index |