HP OpenVMS Systems Documentation

OpenVMS Cluster Systems

C.12 OPA0 Error-Message Logging and Broadcasting

Port drivers detect certain error conditions and attempt to log them. The port driver attempts both OPA0 error broadcasting and standard error logging under any of the following circumstances:

The system disk has not yet been mounted.
The system disk is undergoing mount verification.
During mount verification, the system disk drive contains the wrong volume.
Mount verification for the system disk has timed out.
The local computer is participating in a cluster, and quorum has been lost.

Note the implicit assumption that the system and error-logging devices are one and the same.

The following table describes error-logging methods and their reliability.

Method	Reliability	Comments
Standard error logging to an error-logging device.	Under some circumstances, attempts to log errors to the error-logging device can fail. Such failures can occur because the error-logging device is not accessible when attempts are made to log the error condition.	Because of the central role that the port device plays in clusters, the loss of error-logged information in such cases makes it difficult to diagnose and fix problems.
Broadcasting selected information about the error condition to OPA0. (This is in addition to the port driver's attempt to log the error condition to the error-logging device.)	This method of reporting errors is not entirely reliable, because some error conditions may not be reported due to the way OPA0 error broadcasting is performed. This situation occurs whenever a second error condition is detected before the port driver has been able to broadcast the first error condition to OPA0. In such a case, only the first error condition is reported to OPA0, because that condition is deemed to be the more important one.	This second, redundant method of error logging captures at least some of the information about port-device error conditions that would otherwise be lost.

Note: Certain error conditions are always broadcast to OPA0, regardless of whether the error-logging device is accessible. In general, these are errors that cause the port to shut down either permanently or temporarily.

C.12.1 OPA0 Error Messages

One OPA0 error message for each error condition is always logged. The text of each error message is similar to the text in the summary displayed by formatting the corresponding standard error-log entry using the Error Log utility. (See Section C.11.7 for a list of Error Log utility summary messages and their explanations.)

Table C-8 lists the OPA0 error messages. The table is divided into units by error type. Many of the OPA0 error messages contain some optional information, such as the remote port number, CI packet information (flags, port operation code, response status, and port number fields), or specific CI port registers. The codes specify whether the message is always logged on OPA0 or is logged only when the system device is inaccessible.

**Table C-8 OPA0 Messages**
Error Message	Logged or Inaccessible
Software Errors During Initialization
%Pxxn, Insufficient Non-Paged Pool for Initialization	Logged
%Pxxn, Failed to Locate Port Micro-code Image	Logged
%Pxxn, SCSSYSTEMID has NOT been set to a Non-Zero Value	Logged
Hardware Errors
%Pxxn, BIIC failure---BICSR/BER/CNF xxxxxx/xxxxxx/xxxxxx	Logged
%Pxxn, Micro-code Verification Error	Logged
%Pxxn, Port Transition Failure---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx	Logged
%Pxxn, Port Error Bit(s) Set---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx	Logged
%Pxxn, Port Power Down	Logged
%Pxxn, Port Power Up	Logged
%Pxxn, Unexpected Interrupt---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx	Logged
%Pxxn, CI Port Timeout	Logged
%Pxxn, CI port ucode not at required rev level. ---RAM/PROM rev is xxxx/xxxx	Logged
%Pxxn, CI port ucode not at current rev level.---RAM/PROM rev is xxxx/xxxx	Logged
%Pxxn, CPU ucode not at required rev level for CI activity	Logged
Queue Interlock Failures
%Pxxn, Message Free Queue Remove Failure	Logged
%Pxxn, Datagram Free Queue Remove Failure	Logged
%Pxxn, Response Queue Remove Failure	Logged
%Pxxn, High Priority Command Queue Insert Failure	Logged
%Pxxn, Low Priority Command Queue Insert Failure	Logged
%Pxxn, Message Free Queue Insert Failure	Logged
%Pxxn, Datagram Free Queue Insert Failure	Logged
Errors Signaled with a CI Packet
%Pxxn, Unrecognized SCA Packet---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx	Logged
%Pxxn, Port has Closed Virtual Circuit---REMOTE PORT ¹ xxx	Logged
%Pxxn, Software Shutting Down Port	Logged
%Pxxn, Software is Closing Virtual Circuit---REMOTE PORT ¹ xxx	Logged
%Pxxn, Received Connect Without Path-Block---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx	Logged
%Pxxn, Inappropriate SCA Control Message---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx	Logged
%Pxxn, No Path-Block During Virtual Circuit Close---REMOTE PORT ¹ xxx	Logged
%Pxxn, HSC Error Logging Datagram Received Inaccessible---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Remote System Conflicts with Known System---REMOTE PORT ¹ xxx	Logged
%Pxxn, Virtual Circuit Timeout---REMOTE PORT ¹ xxx	Logged
%Pxxn, Parallel Path is Closing Virtual Circuit--- REMOTE PORT ¹ xxx	Logged
%Pxxn, Insufficient Nonpaged Pool for Virtual Circuits	Logged
Cable Change-of-State Notification
%Pxxn, Path #0. Has gone from GOOD to BAD---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Path #1. Has gone from GOOD to BAD---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Path #0. Has gone from BAD to GOOD---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Path #1. Has gone from BAD to GOOD---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Cables have gone from UNCROSSED to CROSSED---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Cables have gone from CROSSED to UNCROSSED---REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Path #0. Loopback has gone from GOOD to BAD---REMOTE PORT ¹ xxx	Logged
%Pxxn, Path #1. Loopback has gone from GOOD to BAD---REMOTE PORT ¹ xxx	Logged
%Pxxn, Path #0. Loopback has gone from BAD to GOOD---REMOTE PORT ¹ xxx	Logged
%Pxxn, Path #1. Loopback has gone from BAD to GOOD---REMOTE PORT ¹ xxx	Logged
%Pxxn, Path #0. Has become working but CROSSED to Path #1.--- REMOTE PORT ¹ xxx	Inaccessible
%Pxxn, Path #1. Has become working but CROSSED to Path #0.--- REMOTE PORT ¹ xxx	Inaccessible

¹If the port driver can identify the remote SCS node name of the affected computer, the driver replaces the "REMOTE PORT xxx" text with "REMOTE SYSTEM X...", where X... is the value of the system parameter SCSNODE on the remote computer. If the remote SCS node name is not available, the port driver uses the existing message format.

Key to CI Port Registers:

CNF---configuration register
PMC---port maintenance and control register
PSR---port status register

See also the CI hardware documentation for a detailed description of the CI port registers.

C.12.2 CI Port Recovery

Two other messages concerning the CI port appear on OPA0:

%Pxxn, CI port is reinitializing (xxx retries left.)

%Pxxn, CI port is going off line.

The first message indicates that a previous error requiring the port to shut down is recoverable and that the port will be reinitialized. The "xxx retries left" specifies how many more reinitializations are allowed before the port must be left permanently off line. Each reinitialization of the port (for reasons other than power fail recovery) causes approximately 2 KB of nonpaged pool to be lost.

The second message indicates that a previous error is not recoverable and that the port will be left off line. In this case, the only way to recover the port is to reboot the computer.

Appendix D
Sample Programs for LAN Control

Sample programs are provided to start and stop the NISCA protocol on a LAN adapter, and to enable LAN network failure analysis. The following programs are located in SYS$EXAMPLES:

Program	Description
LAVC$START_BUS.MAR	Starts the NISCA protocol on a specified LAN adapter.
LAVC$STOP_BUS.MAR	Stops the NISCA protocol on a specified LAN adapter.
LAVC$FAILURE_ANALYSIS.MAR	Enables LAN network failure analysis.
LAVC$BUILD.COM	Assembles and links the sample programs.

Reference: The NISCA protocol, responsible for carrying messages across Ethernet and FDDI LANs to other nodes in the cluster, is described in Appendix F.

D.1 Purpose of Programs

The port emulator driver, PEDRIVER, starts the NISCA protocol on all of the LAN adapters in the cluster. LAVC$START_BUS.MAR and LAVC$STOP_BUS.MAR are provided for cluster managers who want to split the network load according to protocol type and therefore do not want the NISCA protocol running on all of the LAN adapters.

Reference: See Section D.5 for information about editing and using the network failure analysis program.

D.2 Starting the NISCA Protocol

The sample program LAVC$START_BUS.MAR, provided in SYS$EXAMPLES, starts the NISCA protocol on a specific LAN adapter.

To build the program, perform the following steps:

Step	Action
1	Copy the files LAVC$START_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory.
2	Assemble and link the sample program using the following command: $ @LAVC$BUILD.COM LAVC$START_BUS.MAR

D.2.1 Start the Protocol

To start the protocol on a LAN adapter, perform the following steps:

Step	Action
1	Use an account that has the PHY_IO privilege---you need this to run LAVC$START_BUS.EXE.
2	Define the foreign command (DCL symbol).
3	Execute the foreign command (LAVC$START_BUS.EXE), followed by the name of the LAN adapter on which you want to start the protocol.

Example: The following example shows how to start the NISCA protocol on LAN adapter ETA0:

$ START_BUS:==$SYS$DISK:[ ]LAVC$START_BUS.EXE
$ START_BUS ETA

D.3 Stopping the NISCA Protocol

The sample program LAVC$STOP_BUS.MAR, provided in SYS$EXAMPLES, stops the NISCA protocol on a specific LAN adapter.

Caution: Stopping the NISCA protocol on all LAN adapters causes satellites to hang and could cause cluster systems to fail with a CLUEXIT bugcheck.

Follow the steps below to build the program:

Step	Action
1	Copy the files LAVC$STOP_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory.
2	Assemble and link the sample program using the following command: $ @LAVC$BUILD.COM LAVC$STOP_BUS.MAR

D.3.1 Stop the Protocol

To stop the NISCA protocol on a LAN adapter, perform the following steps:

Step	Action
1	Use an account that has the PHY_IO privilege---you need this to run LAVC$STOP_BUS.EXE.
2	Define the foreign command (DCL symbol).
3	Execute the foreign command (LAVC$STOP_BUS.EXE), followed by the name of the LAN adapter on which you want to stop the protocol.

Example: The following example shows how to stop the NISCA protocol on LAN adapter ETA0:

$ STOP_BUS:==$SYS$DISK[ ]LAVC$STOP_BUS.EXE
$ STOP_BUS ETA

D.3.2 Verify Successful Execution

When the LAVC$STOP_BUS module executes successfully, the following device-attention entry is written to the system error log:

DEVICE ATTENTION...

NI-SCS SUB-SYSTEM...

FATAL ERROR DETECTED BY DATALINK...

In addition, the following hexidecimal values are written to the STATUS field of the entry:

First longword (00000001)
Second longword (00001201)

The error-log entry indicates expected behavior and can be ignored. However, if the first longword of the STATUS field contains a value other than hexidecimal value 00000001, an error has occurred and further investigation may be necessary.

D.4 Analyzing Network Failures

LAVC$FAILURE_ANALYSIS.MAR is a sample program, located in SYS$EXAMPLES, that you can edit and use to help detect and isolate a failed network component. When the program executes, it provides the physical description of your cluster communications network to the set of routines that perform the failure analysis.

D.4.1 Failure Analysis

Using the network failure analysis program can help reduce the time necessary for detection and isolation of a failing network component and, therefore, significantly increase cluster availability.

D.4.2 How the LAVC$FAILURE_ANALYSIS Program Works

The following table describes how the LAVC$FAILURE_ANALYSIS program works.

Step

Program Action

The program groups channels that fail and compares them with the physical description of the cluster network.

The program then develops a list of nonworking network components related to the failed channels and uses OPCOM messages to display the names of components with a probability of causing one or more channel failures.

If the network failure analysis cannot verify that a portion of a path (containing multiple components) works, the program:

Calls out the first component in the path as the primary suspect (%LAVC-W-PSUSPECT)
Lists the other components as secondary or additional suspects (%LAVC-I-ASUSPECT)

When the component works again, OPCOM displays the message %LAVC-S-WORKING.

D.5 Using the Network Failure Analysis Program

Table D-1 describes the steps you perform to edit and use the network failure analysis program.

**Table D-1 Procedure for Using the LAVC$FAILURE_ANALYSIS.MAR Program**
Step	Action	Reference
1	Collect and record information specific to your cluster communications network.	Section D.5.1
2	Edit a copy of LAVC$FAILURE_ANALYSIS.MAR to include the information you collected.	Section D.5.2
3	Assemble, link, and debug the program.	Section D.5.3
4	Modify startup files to run the program only on the node for which you supplied data.	Section D.5.4
5	Execute the program on one or more of the nodes where you plan to perform the network failure analysis.	Section D.5.5
6	Modify MODPARAMS.DAT to increase the values of nonpaged pool parameters.	Section D.5.6
7	Test the Local Area OpenVMS Cluster Network Failure Analysis Program.	Section D.5.7

D.5.1 Create a Network Diagram

Follow the steps in Table D-2 to create a physical description of the network configuration and include it in electronic form in the LAVC$FAILURE_ANALYSIS.MAR program.

**Table D-2 Creating a Physical Description of the Network**
Step	Action	Comments
1	Draw a diagram of your OpenVMS Cluster communications network.	When you edit LAVC$FAILURE_ANALYSIS.MAR, you include this drawing (in electronic form) in the program. Your drawing should show the physical layout of the cluster and include the following components: LAN segments or rings LAN bridges Wiring concentrators, DELNI interconnects, or DEMPR repeaters LAN adapters VAX and Alpha systems For large clusters, you may need to verify the configuration by tracing the cables.
2	Give each component in the drawing a unique label.	If your OpenVMS Cluster contains a large number of nodes, you may want to replace each node name with a shorter abbreviation. Abbreviating node names can help save space in the electronic form of the drawing when you include it in LAVC$FAILURE_ANALYSIS.MAR. For example, you can replace the node name ASTRA with A and call node ASTRA's two LAN adapters A1 and A2.
3	List the following information for each component: Unique label Type [SYSTEM, LAN_ADP, DELNI] Location (the physical location of the component) LAN address or addresses (if applicable)	Devices such as DELNI interconnects, DEMPR repeaters, and cables do not have LAN addresses.
4	Classify each component into one of the following categories: Node: VAX or Alpha system in the OpenVMS Cluster configuration. Adapter: LAN adapter on the system that is normally used for OpenVMS Cluster communications. Component: Generic component in the network. Components in this category can usually be shown to be working if at least one path through them is working. Wiring concentrators, DELNI interconnects, DEMPR repeaters, LAN bridges, and LAN segments and rings typically fall into this category. Cloud: Generic component in the network. Components in this category cannot be shown to be working even if one or more paths are shown to be working.	The cloud component is necessary only when multiple paths exist between two points within the network, such as with redundant bridging between LAN segments. At a high level, multiple paths can exist; however, during operation, this bridge configuration allows only one path to exist at one time. In general, this bridge example is probably better handled by representing the active bridge in the description as a component and ignoring the standby bridge. (You can identify the active bridge with such network monitoring software as RBMS or DECelms.) With the default bridge parameters, failure of the active bridge will be called out.
5	Use the component labels from step 3 to describe each of the connections in the OpenVMS Cluster communications network.
6	Choose a node or group of nodes to run the network failure analysis program.	You should run the program only on a node that you included in the physical description when you edited LAVC$FAILURE_ANALYSIS.MAR. The network failure analysis program on one node operates independently from other systems in the OpenVMS Cluster. So, for executing the network failure analysis program, you should choose systems that are not normally shut down. Other good candidates for running the program are systems with the following characteristics: Faster CPU speed Larger amounts of memory More LAN adapters (running the NISCA protocol) Note: The physical description is loaded into nonpaged pool, and all processing is performed at IPL 8. CPU use increases as the average number of network components in the network path increases. CPU use also increases as the total number of network paths increases.

Contents

Index