HP OpenVMS Cluster Systems

C.5 Diagnosing LAN Component Failures

Section D.5 provides troubleshooting techniques for LAN component failures (for example, broken LAN bridges). That appendix also describes techniques for using the Local Area OpenVMS Cluster Network Failure Analysis Program.

Intermittent LAN component failures (for example, packet loss) can cause problems in the NISCA transport protocol that delivers System Communications Services (SCS) messages to other nodes in the OpenVMS Cluster. Appendix F describes troubleshooting techniques and requirements for LAN analyzer tools.

C.6 Diagnosing Cluster Hangs

Conditions like the following can cause a OpenVMS Cluster computer to suspend process or system activity (that is, to hang):

Condition Reference

Cluster quorum is lost. Section C.6.1

A shared cluster resource is inaccessible. Section C.6.2

Condition	Reference
Cluster quorum is lost.	Section C.6.1
A shared cluster resource is inaccessible.	Section C.6.2

C.6.1 Cluster Quorum is Lost

The OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS Cluster computers and ensures the integrity of shared cluster resources. (The quorum algorithm is described fully in Chapter 2.) Quorum is checked after any change to the cluster configuration---for example, when a voting computer leaves or joins the cluster. If quorum is lost, process and I/O activity on all computers in the cluster are blocked.

Information about the loss of quorum and about clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each computer's operator console (OPA0), unless broadcast activity is explicitly disabled on that terminal. However, because quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OPA0 are the most reliable source of information about events that cause loss of quorum.

If quorum is lost, you might add or reboot a node with additional votes.

Reference: See also the information about cluster quorum in Section 10.11.

C.6.2 Inaccessible Cluster Resource

Access to shared cluster resources is coordinated by the distributed lock manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang.

Occasionally, a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang.

Access to files that contain data necessary for the operation of the system itself is coordinated by the distributed lock manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang.

For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging in to an account with the same or similar user name or sending mail to that user name, is blocked until the original lock is released. Normally, this lock is released quickly, and users do not notice the locking operation.

However, if the process holding the lock is unable to proceed, other processes could enter a wait state. Because the authorization file is used during login and for most process creation operations (for example, batch and network jobs), blocked processes could rapidly accumulate in the cluster. Because the distributed lock manager is functioning normally under these conditions, users are not notified by broadcast messages or other means that a problem has occurred.

C.7 Diagnosing CLUEXIT Bugchecks

The operating system performs bugcheck operations only when it detects conditions that could compromise normal system activity or endanger data integrity. A CLUEXIT bugcheck is a type of bugcheck initiated by the connection manager, the OpenVMS Cluster software component that manages the interaction of cooperating OpenVMS Cluster computers. Most such bugchecks are triggered by conditions resulting from hardware failures (particularly failures in communications paths), configuration errors, or system management errors.

C.7.1 Conditions Causing Bugchecks

The most common conditions that result in CLUEXIT bugchecks are as follows:

Possible Bugcheck Causes Recommendations

The cluster connection between two computers is broken for longer than RECNXINTERVAL seconds. Thereafter, the connection is declared irrevocably broken. If the connection is later reestablished, one of the computers shut down with a CLUEXIT bugcheck.
This condition can occur:

Upon recovery with battery backup after a power failure
After the repair of an SCS communication link
After the computer was halted for a period longer than the number of seconds specified for the RECNXINTERVAL parameter and was restarted with a CONTINUE command entered at the operator console
Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.

Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file. Review the setting of EXPECTED_VOTES on all computers.

The value specified for the SCSMAXMSG system parameter on a computer is too small. Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is set to a value that is at the least the default value.

Possible Bugcheck Causes	Recommendations
The cluster connection between two computers is broken for longer than RECNXINTERVAL seconds. Thereafter, the connection is declared irrevocably broken. If the connection is later reestablished, one of the computers shut down with a CLUEXIT bugcheck. This condition can occur: Upon recovery with battery backup after a power failure After the repair of an SCS communication link After the computer was halted for a period longer than the number of seconds specified for the RECNXINTERVAL parameter and was restarted with a CONTINUE command entered at the operator console	Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.
Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file.	Review the setting of EXPECTED_VOTES on all computers.
The value specified for the SCSMAXMSG system parameter on a computer is too small.	Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is set to a value that is at the least the default value.

C.8 Port Communications

These sections provide detailed information about port communications to assist in diagnosing port communication problems.

C.8.1 LAN Communications

For clusters that include Ethernet or FDDI interconnects, a multicast scheme is used to locate computers on the LAN. Approximately every 3 seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram message through each LAN adapter to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other computers. When the driver receives a HELLO datagram message from a computer with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO datagram messages received from a computer with a currently open virtual circuit indicate that the remote computer is operational.

A standard, three-message exchange handshake is used to create a virtual circuit. The handshake messages contain information about the transmitting computer and its record of the cluster password. These parameters are verified at the receiving computer, which continues the handshake only if its verification is successful. Thus, each computer authenticates the other. After the final message, the virtual circuit is opened for use by both computers.

C.8.2 System Communications Services (SCS) Connections

System services such as the disk class driver, connection manager, and the MSCP and TMSCP servers communicate between computers with a protocol called System Communications Services (SCS). SCS is responsible primarily for forming and breaking intersystem process connections and for controlling flow of message traffic over those connections. SCS is implemented in the port driver (for example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the operating system called SCSLOA.EXE (loaded automatically during system initialization).

When a virtual circuit has been opened, a computer periodically probes a remote computer for system services that the remote computer may be offering. The SCS directory service, which makes known services that a computer is offering, is always present both on computers and HSC subsystems. As system services discover their counterparts on other computers and HSC subsystems, they establish SCS connections to each other. These connections are full duplex and are associated with a particular virtual circuit. Multiple connections are typically associated with a virtual circuit.

C.9 Diagnosing Port Failures

This section describes the hierarchy of communication paths and describes where failures can occur.

C.9.1 Hierarchy of Communication Paths

Taken together, SCS, the port drivers, and the port itself support a hierarchy of communication paths. Starting with the most fundamental level, these are as follows:

The physical wires. The Ethernet is a single coaxial cable. The port chooses the free path or, if both are free, an arbitrary path (implemented in the cables and managed by the port).
The virtual circuit (implemented in LAN port emulator driver (PEDRIVER) and partly in SCS software).
The SCS connections (implemented in system software).

C.9.2 Where Failures Occur

Failures can occur at each communication level and in each component. Failures at one level translate into failures elsewhere, as described in Table C-3.

Table C-3 Port Failures
Communication Level Failures

Wires If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports.

Virtual circuit If no path works between a pair of ports, the virtual circuit fails and is closed. A path failure is discovered for the LAN, when no multicast HELLO datagram message or incoming traffic is received from another computer.
When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected.

LAN adapter If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken.

SCS connection When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer.

Computer If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer.

**Table C-3 Port Failures**
Communication Level	Failures
Wires	If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports.
Virtual circuit	If no path works between a pair of ports, the virtual circuit fails and is closed. A path failure is discovered for the LAN, when no multicast HELLO datagram message or incoming traffic is received from another computer. When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected.
LAN adapter	If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken.
SCS connection	When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer.
Computer	If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer.

C.9.3 Verifying Virtual Circuits

To diagnose communication problems, you can invoke the Show Cluster utility using the instructions in Table C-4.

Table C-4 How to Verify Virtual Circuit States
Step Action What to Look for

1 Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT,CABLE_STATUS. This command adds a class of information about all the virtual circuits as seen from the computer on which you are running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for the circuit from the CI interface on the local system to the CI interface on the remote system. Primarily, you are checking whether there is a virtual circuit in the OPEN state to the failing computer. Common causes of failure to open a virtual circuit and keep it open are the following:

Port errors on one side or the other
Cabling errors
A port set off line because of software problems
Insufficient nonpaged pool on both sides
Failure to set correct values for the SCSNODE, SCSSYSTEMID, PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINTERVAL system parameters

2 Run SHOW CLUSTER from each active computer in the cluster to verify whether each computer's view of the failing computer is consistent with every other computer's view.

WHEN... THEN...

All the active computers have a consistent view of the failing computer The problem may be in the failing computer.

Only one of several active computers detects that the newcomer is failing That particular computer may have a problem.

If no virtual circuit is open to the failing computer, check the bottom of the SHOW CLUSTER display:

For information about circuits to the port of the failing computer. Virtual circuits in partially open states are shown at the bottom of the display. If the circuit is shown in a state other than OPEN, communications between the local and remote ports are taking place, and the failure is probably at a higher level than in port or cable hardware.
To see whether both path A and path B to the failing port are good. The loss of one path should not prevent a computer from participating in a cluster.

C.9.4 Verifying LAN Connections

The Local Area OpenVMS Cluster Network Failure Analysis Program described in Section D.4 uses the HELLO datagram messages to verify continuously the network paths (channels) used by PEDRIVER. This verification process, combined with physical description of the network, can:

Isolate failing network components
Group failing channels together and map them onto the physical network description
Call out the common components related to the channel failures

C.10 Analyzing Error-Log Entries for Port Devices

Monitoring events recorded in the error log can help you anticipate and avoid potential problems. From the total error count (displayed by the DCL command SHOW DEVICES device-name), you can determine whether errors are increasing. If so, you should examine the error log.

C.10.1 Examine the Error Log

The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to report the contents of an error-log file.

Reference: For more information about the Error Log utility, see the HP OpenVMS System Management Utilities Reference Manual.

Some error-log entries are informational only while others require action.

Table C-5 Informational and Other Error-Log Entries
Error Type Action Required? Purpose

Informational error-log entries require no action. For example, if you shut down a computer in the cluster, all other active computers that have open virtual circuits between themselves and the computer that has been shut down make entries in their error logs. Such computers record up to three errors for the event:

Path A received no response.
Path B received no response.
The virtual circuit is being closed.
No These messages are normal and reflect the change of state in the circuits to the computer that has been shut down.

Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions. Yes Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths).

**Table C-5 Informational and Other Error-Log Entries**
Error Type	Action Required?	Purpose
Informational error-log entries require no action. For example, if you shut down a computer in the cluster, all other active computers that have open virtual circuits between themselves and the computer that has been shut down make entries in their error logs. Such computers record up to three errors for the event: Path A received no response. Path B received no response. The virtual circuit is being closed.	No	These messages are normal and reflect the change of state in the circuits to the computer that has been shut down.
Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions.	Yes	Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths).

C.10.2 Formats

Errors and other events on LAN cause port drivers to enter information in the system error log in one of two formats:

Device attention
Device-attention entries for the LAN, device-attention entries typically record errors on a LAN adapter device.
Logged message
Logged-message entries record the receipt of a message packet that contains erroneous data or that signals an error condition.

Section C.10.4 describe those formats.

C.10.3 LAN Device-Attention Entries

Example C-1 shows device-attention entries for the LAN.

Example C-1 LAN Device-Attention Entry

**** V3.4 ********************* ENTRY 337 ******************************** (1) Logging OS 1. OpenVMS System Architecture 2. Alpha OS version XC56-BL2 Event sequence number 96. Timestamp of occurrence 16-SEP-2009 16:33:03 (2) Time since reboot 0 Day(s) 0:50:08 Host name PERK System Model AlphaServer ES45 Model 2 (3) Entry Type 98. Asynchronous Device Attention ---- Device Profile ---- Unit PERK$PEA0 (4) Product Name NI-SCA Port ---- NISCA Port Data ---- Error Type and SubType x0700 Device Error, Fatal Error Detected by Datalink(5) Status x0000120100000001 (6) Datalink Device Name EIA2: (7) Remote Node Name (8) Remote Address x0000000000000000 (9) Local Address x000063B4000400AA (10) Error Count 1. Error Occurrences This Entry (11) ----- Software Info ----- UCB$x_ERRCNT 2. Errors This Unit

The following table describes the LAN device-attention entries in Example C-1.

Entry Description

(1) The four lines are the entry heading. These lines contain the number of the entry in this error log file, the architecture, the OS version and the sequence number of this error. Each entry in the log file contains such a heading.

(2) This line contains the date and time.

(3) The next two lines contain the system model and the entry type.

(4) This line shows the name of the subsystem and component that caused the entry.

(5) This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible.

(6) The first longword shows the I/O completion status returned by the LAN driver. The second longword is the VCI event code delivered to PEDRIVER by the LAN driver.

(7) DATALINK NAME is the name of the LAN device on which the error occurred.

(8) REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error.

(9) REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error.

(10) LOCAL ADDR is the LAN address of the local node.

(11) ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry.

Entry	Description
(1)	The four lines are the entry heading. These lines contain the number of the entry in this error log file, the architecture, the OS version and the sequence number of this error. Each entry in the log file contains such a heading.
(2)	This line contains the date and time.
(3)	The next two lines contain the system model and the entry type.
(4)	This line shows the name of the subsystem and component that caused the entry.
(5)	This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible.
(6)	The first longword shows the I/O completion status returned by the LAN driver. The second longword is the VCI event code delivered to PEDRIVER by the LAN driver.
(7)	DATALINK NAME is the name of the LAN device on which the error occurred.
(8)	REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error.
(9)	REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error.
(10)	LOCAL ADDR is the LAN address of the local node.
(11)	ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry.

Contents

Index