HP OpenVMS Systems Documentation |
OpenVMS Cluster Systems
C.12 OPA0 Error-Message Logging and BroadcastingPort drivers detect certain error conditions and attempt to log them. The port driver attempts both OPA0 error broadcasting and standard error logging under any of the following circumstances:
Note the implicit assumption that the system and error-logging devices are one and the same. The following table describes error-logging methods and their reliability.
Note: Certain error conditions are always broadcast to
OPA0, regardless of whether the error-logging device is accessible. In
general, these are errors that cause the port to shut down either
permanently or temporarily.
One OPA0 error message for each error condition is always logged. The text of each error message is similar to the text in the summary displayed by formatting the corresponding standard error-log entry using the Error Log utility. (See Section C.11.7 for a list of Error Log utility summary messages and their explanations.) Table C-8 lists the OPA0 error messages. The table is divided into units by error type. Many of the OPA0 error messages contain some optional information, such as the remote port number, CI packet information (flags, port operation code, response status, and port number fields), or specific CI port registers. The codes specify whether the message is always logged on OPA0 or is logged only when the system device is inaccessible.
1If the port driver can identify the remote SCS node name of the affected computer, the driver replaces the "REMOTE PORT xxx" text with "REMOTE SYSTEM X...", where X... is the value of the system parameter SCSNODE on the remote computer. If the remote SCS node name is not available, the port driver uses the existing message format. Key to CI Port Registers: CNF---configuration register See also the CI hardware documentation for a detailed description of the CI port registers. C.12.2 CI Port RecoveryTwo other messages concerning the CI port appear on OPA0:
The first message indicates that a previous error requiring the port to shut down is recoverable and that the port will be reinitialized. The "xxx retries left" specifies how many more reinitializations are allowed before the port must be left permanently off line. Each reinitialization of the port (for reasons other than power fail recovery) causes approximately 2 KB of nonpaged pool to be lost. The second message indicates that a previous error is not recoverable and that the port will be left off line. In this case, the only way to recover the port is to reboot the computer.
Appendix D
|
Program | Description |
---|---|
LAVC$START_BUS.MAR | Starts the NISCA protocol on a specified LAN adapter. |
LAVC$STOP_BUS.MAR | Stops the NISCA protocol on a specified LAN adapter. |
LAVC$FAILURE_ANALYSIS.MAR | Enables LAN network failure analysis. |
LAVC$BUILD.COM | Assembles and links the sample programs. |
Reference: The NISCA protocol, responsible for
carrying messages across Ethernet and FDDI LANs to other nodes in the
cluster, is described in Appendix F.
D.1 Purpose of Programs
The port emulator driver, PEDRIVER, starts the NISCA protocol on all of the LAN adapters in the cluster. LAVC$START_BUS.MAR and LAVC$STOP_BUS.MAR are provided for cluster managers who want to split the network load according to protocol type and therefore do not want the NISCA protocol running on all of the LAN adapters.
Reference: See Section D.5 for information about
editing and using the network failure analysis program.
D.2 Starting the NISCA Protocol
The sample program LAVC$START_BUS.MAR, provided in SYS$EXAMPLES, starts the NISCA protocol on a specific LAN adapter.
To build the program, perform the following steps:
Step | Action |
---|---|
1 | Copy the files LAVC$START_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory. |
2 |
Assemble and link the sample program using the following command:
$ @LAVC$BUILD.COM LAVC$START_BUS.MAR |
To start the protocol on a LAN adapter, perform the following steps:
Step | Action |
---|---|
1 | Use an account that has the PHY_IO privilege---you need this to run LAVC$START_BUS.EXE. |
2 | Define the foreign command (DCL symbol). |
3 | Execute the foreign command (LAVC$START_BUS.EXE), followed by the name of the LAN adapter on which you want to start the protocol. |
Example: The following example shows how to start the NISCA protocol on LAN adapter ETA0:
$ START_BUS:==$SYS$DISK:[ ]LAVC$START_BUS.EXE $ START_BUS ETA |
The sample program LAVC$STOP_BUS.MAR, provided in SYS$EXAMPLES, stops the NISCA protocol on a specific LAN adapter.
Caution: Stopping the NISCA protocol on all LAN adapters causes satellites to hang and could cause cluster systems to fail with a CLUEXIT bugcheck.
Follow the steps below to build the program:
Step | Action |
---|---|
1 | Copy the files LAVC$STOP_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory. |
2 |
Assemble and link the sample program using the following command:
$ @LAVC$BUILD.COM LAVC$STOP_BUS.MAR |
D.3.1 Stop the Protocol
To stop the NISCA protocol on a LAN adapter, perform the following
steps:
Step | Action |
---|---|
1 | Use an account that has the PHY_IO privilege---you need this to run LAVC$STOP_BUS.EXE. |
2 | Define the foreign command (DCL symbol). |
3 | Execute the foreign command (LAVC$STOP_BUS.EXE), followed by the name of the LAN adapter on which you want to stop the protocol. |
Example: The following example shows how to stop the NISCA protocol on LAN adapter ETA0:
$ STOP_BUS:==$SYS$DISK[ ]LAVC$STOP_BUS.EXE $ STOP_BUS ETA |
When the LAVC$STOP_BUS module executes successfully, the following device-attention entry is written to the system error log:
DEVICE ATTENTION... NI-SCS SUB-SYSTEM... FATAL ERROR DETECTED BY DATALINK... |
In addition, the following hexidecimal values are written to the STATUS field of the entry:
First longword (00000001)
Second longword (00001201)
The error-log entry indicates expected behavior and can be ignored.
However, if the first longword of the STATUS field contains a value
other than hexidecimal value 00000001, an error has occurred and
further investigation may be necessary.
D.4 Analyzing Network Failures
LAVC$FAILURE_ANALYSIS.MAR is a sample program, located in SYS$EXAMPLES,
that you can edit and use to help detect and isolate a failed network
component. When the program executes, it provides the physical
description of your cluster communications network to the set of
routines that perform the failure analysis.
D.4.1 Failure Analysis
Using the network failure analysis program can help reduce the time
necessary for detection and isolation of a failing network component
and, therefore, significantly increase cluster availability.
D.4.2 How the LAVC$FAILURE_ANALYSIS Program Works
The following table describes how the LAVC$FAILURE_ANALYSIS program works.
Step | Program Action |
---|---|
1 | The program groups channels that fail and compares them with the physical description of the cluster network. |
2 |
The program then develops a list of nonworking network components
related to the failed channels and uses OPCOM messages to display the
names of components with a probability of causing one or more channel
failures.
If the network failure analysis cannot verify that a portion of a path (containing multiple components) works, the program:
|
3 | When the component works again, OPCOM displays the message %LAVC-S-WORKING. |
Table D-1 describes the steps you perform to edit and use the network failure analysis program.
Step | Action | Reference |
---|---|---|
1 | Collect and record information specific to your cluster communications network. | Section D.5.1 |
2 | Edit a copy of LAVC$FAILURE_ANALYSIS.MAR to include the information you collected. | Section D.5.2 |
3 | Assemble, link, and debug the program. | Section D.5.3 |
4 | Modify startup files to run the program only on the node for which you supplied data. | Section D.5.4 |
5 | Execute the program on one or more of the nodes where you plan to perform the network failure analysis. | Section D.5.5 |
6 | Modify MODPARAMS.DAT to increase the values of nonpaged pool parameters. | Section D.5.6 |
7 | Test the Local Area OpenVMS Cluster Network Failure Analysis Program. | Section D.5.7 |
D.5.1 Create a Network Diagram
Follow the steps in Table D-2 to create a physical description of
the network configuration and include it in electronic form in the
LAVC$FAILURE_ANALYSIS.MAR program.
Step | Action | Comments |
---|---|---|
1 | Draw a diagram of your OpenVMS Cluster communications network. |
When you edit LAVC$FAILURE_ANALYSIS.MAR, you include this drawing (in
electronic form) in the program. Your drawing should show the physical
layout of the cluster and include the following components:
For large clusters, you may need to verify the configuration by tracing the cables. |
2 | Give each component in the drawing a unique label. | If your OpenVMS Cluster contains a large number of nodes, you may want to replace each node name with a shorter abbreviation. Abbreviating node names can help save space in the electronic form of the drawing when you include it in LAVC$FAILURE_ANALYSIS.MAR. For example, you can replace the node name ASTRA with A and call node ASTRA's two LAN adapters A1 and A2. |
3 |
List the following information for each component:
|
Devices such as DELNI interconnects, DEMPR repeaters, and cables do not have LAN addresses. |
4 |
Classify each component into one of the following categories:
|
The cloud component is necessary only when multiple paths exist between two points within the network, such as with redundant bridging between LAN segments. At a high level, multiple paths can exist; however, during operation, this bridge configuration allows only one path to exist at one time. In general, this bridge example is probably better handled by representing the active bridge in the description as a component and ignoring the standby bridge. (You can identify the active bridge with such network monitoring software as RBMS or DECelms.) With the default bridge parameters, failure of the active bridge will be called out. |
5 | Use the component labels from step 3 to describe each of the connections in the OpenVMS Cluster communications network. | |
6 | Choose a node or group of nodes to run the network failure analysis program. |
You should run the program only on a node that you included in the
physical description when you edited LAVC$FAILURE_ANALYSIS.MAR. The
network failure analysis program on one node operates independently
from other systems in the OpenVMS Cluster. So, for executing the
network failure analysis program, you should choose systems that are
not normally shut down. Other good candidates for running the program
are systems with the following characteristics:
Note: The physical description is loaded into nonpaged pool, and all processing is performed at IPL 8. CPU use increases as the average number of network components in the network path increases. CPU use also increases as the total number of network paths increases. |
Previous | Next | Contents | Index |