|
OpenVMS Cluster Systems
C.10.4 Verifying Virtual Circuits
To diagnose communication problems, you can invoke the Show Cluster
utility using the instructions in Table C-4.
Table C-4 How to Verify Virtual Circuit States
Step |
Action |
What to Look for |
1
|
Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD
CIRCUIT,CABLE_STATUS. This command adds a class of information about
all the virtual circuits as seen from the computer on which you are
running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for
the circuit from the CI interface on the local system to the CI
interface on the remote system.
|
Primarily, you are checking whether there is a virtual circuit in the
OPEN state to the failing computer. Common causes of failure to open a
virtual circuit and keep it open are the following:
- Port errors on one side or the other
- Cabling errors
- A port set off line because of software problems
- Insufficient nonpaged pool on both sides
- Failure to set correct values for the SCSNODE, SCSSYSTEMID,
PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINTERVAL system parameters
|
2
|
Run SHOW CLUSTER from each active computer in the cluster to verify
whether each computer's view of the failing computer is consistent with
every other computer's view.
WHEN... |
THEN... |
All the active computers have a consistent view of the failing computer
|
The problem may be in the failing computer.
|
Only one of several active computers detects that the newcomer is
failing
|
That particular computer may have a problem.
|
|
If no virtual circuit is open to the failing computer, check the bottom
of the SHOW CLUSTER display:
- For information about circuits to the port of the failing computer.
Virtual circuits in partially open states are shown at the bottom of
the display. If the circuit is shown in a state other than OPEN,
communications between the local and remote ports are taking place, and
the failure is probably at a higher level than in port or cable
hardware.
- To see whether both path A and path B to the failing port are good.
The loss of one path should not prevent a computer from participating
in a cluster.
|
C.10.5 Verifying CI Cable Connections
Whenever the configuration poller finds that no virtual circuits are
open and that no handshake procedures are currently opening virtual
circuits, the poller analyzes its environment. It does so by using the
send-loopback-datagram facility of the CI port in the following fashion:
- The send-loopback-datagram facility tests the connections between
the CI port and the star coupler by routing messages across them. The
messages are called loopback datagrams. (The port processes other
self-directed messages without using the star coupler or external
cables.)
- The configuration poller makes entries in the error log whenever
it detects a change in the state of a circuit. Note, however, that it
is possible two changed-to-failed-state messages can be entered in the
log without an intervening changed-to-succeeded-state message. Such a
series of entries means that the circuit state continues to be faulty.
C.10.6 Diagnosing CI Cabling Problems
The following paragraphs discuss various incorrect CI cabling
configurations and the entries made in the error log when these
configurations exist. Figure C-1 shows a two-computer configuration
with all cables correctly connected. Figure C-2 shows a CI cluster
with a pair of crossed cables.
Figure C-1 Correctly Connected Two-Computer CI Cluster
Figure C-2 Crossed CI Cable Pair
If a pair of transmitting cables or a pair of receiving cables is
crossed, a message sent on TA is received on RB, and a message sent on
TB is received on RA. This is a hardware error condition from which the
port cannot recover. An entry is made in the error log indicating that
a single pair of crossed cables exists. The entry contains the
following lines:
DATA CABLE(S) CHANGE OF STATE
PATH 1. LOOPBACK HAS GONE FROM GOOD TO BAD
|
If this situation exists, you can correct it by reconnecting the cables
properly. The cables could be misconnected in several places. The
coaxial cables that connect the port boards to the bulkhead cable
connectors can be crossed, or the cables can be misconnected to the
bulkhead or the star coupler.
Configuration 1: The information illustrated in
Figure C-2 is represented more simply in Example C-1. It shows the
cables positioned as in Figure C-2, but it does not show the star
coupler or the computers. The labels LOC (local) and REM (remote)
indicate the pairs of transmitting (T) and receiving (R) cables on the
local and remote computers, respectively.
Example C-1 Crossed Cables: Configuration
1 |
The pair of crossed cables causes loopback datagrams to fail on the
local computer but to succeed on the remote computer. Crossed pairs of
transmitting cables and crossed pairs of receiving cables cause the
same behavior.
Note that only an odd number of crossed cable pairs causes these
problems. If an even number of cable pairs is crossed, communications
succeed. An error log entry is made in some cases, however, and the
contents of the entry depends on which pairs of cables are crossed.
Configuration 2: Example C-2 shows two-computer
clusters with the combinations of two crossed cable pairs. These
crossed pairs cause the following entry to be made in the error log of
the computer that has the cables crossed:
DATA CABLE(S) CHANGE OF STATE
CABLES HAVE GONE FROM UNCROSSED TO CROSSED
|
Loopback datagrams succeed on both computers, and communications are
possible.
Example C-2 Crossed Cables: Configuration
2 |
T x = R T = x R
R x = T R = x T
LOC REM LOC REM
|
Configuration 3: Example C-3 shows the possible
combinations of two pairs of crossed cables that cause loopback
datagrams to fail on both computers in the cluster. Communications can
still take place between the computers. An entry stating that cables
are crossed is made in the error log of each computer.
Example C-3 Crossed Cables: Configuration
3 |
T x = R T = x R
R = x T R x = T
LOC REM LOC REM
|
Configuration 4: Example C-4 shows the possible
combinations of two pairs of crossed cables that cause loopback
datagrams to fail on both computers in the cluster but that allow
communications. No entry stating that cables are crossed is made in the
error log of either computer.
Example C-4 Crossed Cables: Configuration
4 |
T x x R T = = R
R = = T R x x T
LOC REM LOC REM
|
Configuration 5: Example C-5 shows the possible
combinations of four pairs of crossed cables. In each case, loopback
datagrams fail on the computer that has only one crossed pair of
cables. Loopback datagrams succeed on the computer with both pairs
crossed. No communications are possible.
Example C-5 Crossed Cables: Configuration
5 |
T x x R T x = R T = x R T x x R
R x = T R x x T R x x T R = x T
LOC REM LOC REM LOC REM LOC REM
|
If all four cable pairs between two computers are crossed,
communications succeed, loopback datagrams succeed, and no
crossed-cable message entries are made in the error log. You might
detect such a condition by noting error log entries made by a third
computer in the cluster, but this occurs only if the third computer has
one of the crossed-cable cases described.
C.10.7 Repairing CI Cables
This section describes some ways in which Compaq support
representatives can make repairs on a running computer. This
information is provided to aid system managers in scheduling repairs.
For cluster software to survive cable-checking activities or
cable-replacement activities, you must be sure that either path A or
path B is intact at all times between each port and between every other
port in the cluster.
For example, you can remove path A and path B in turn from a particular
port to the star coupler. To make sure that the configuration poller
finds a path that was previously faulty but is now operational, follow
these steps:
Step |
Action |
1
|
Remove path B.
|
2
|
After the poller has discovered that path B is faulty, reconnect path B.
|
3
|
Wait two poller intervals,
1 and then take either of the following actions:
- Enter the DCL command SHOW CLUSTER to make sure that the poller has
reestablished path B.
- Enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW
CLUSTER command ADD CIRCUITS, CABLE_ST.
|
4
|
Wait for SHOW CLUSTER to tell you that path B has been reestablished.
|
5
|
Remove path A.
|
6
|
After the poller has discovered that path A is faulty, reconnect path A.
|
7
|
Wait two poller intervals
1 to make sure that the poller has reestablished path A.
|
1Approximately 10 seconds at the default system parameter
settings
If both paths are lost at the same time, the virtual circuits are lost
between the port with the broken cables and all other ports in the
cluster. This condition will in turn result in loss of SCS connections
over the broken virtual circuits. However, recovery from this situation
is automatic after an interruption in service on the affected computer.
The length of the interruption varies, but it is approximately two
poller intervals at the default system parameter settings.
C.10.8 Verifying LAN Connections
The Local Area OpenVMS Cluster Network Failure Analysis Program
described in Section D.4 uses the HELLO datagram messages to verify
continuously the network paths (channels) used by PEDRIVER. This
verification process, combined with physical description of the
network, can:
- Isolate failing network components
- Group failing channels together and map them onto the physical
network description
- Call out the common components related to the channel failures
C.11 Analyzing Error-Log Entries for Port Devices
Monitoring events recorded in the error log can help you anticipate and
avoid potential problems. From the total error count (displayed by the
DCL command SHOW DEVICES device-name), you can determine
whether errors are increasing. If so, you should examine the error log.
C.11.1 Examine the Error Log
The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to
report the contents of an error-log file.
Reference: For more information about the Error Log
utility, see the OpenVMS System Management Utilities Reference Manual.
Some error-log entries are informational only while others require
action.
Table C-5 Informational and Other Error-Log Entries
Error Type |
Action Required? |
Purpose |
Informational error-log entries require no action. For
example, if you shut down a computer in the cluster, all other active
computers that have open virtual circuits between themselves and the
computer that has been shut down make entries in their error logs. Such
computers record up to three errors for the event:
- Path A received no response.
- Path B received no response.
- The virtual circuit is being closed.
|
No
|
These messages are normal and reflect the change of state in the
circuits to the computer that has been shut down.
|
Other error-log entries indicate problems that degrade
operation or nonfatal hardware problems. The operating system might
continue to run satisfactorily under these conditions.
|
Yes
|
Detecting these problems early is important to preventing nonfatal
problems (such as loss of a single CI path) from becoming serious
problems (such as loss of both paths).
|
C.11.2 Formats
Errors and other events on the CI, DSSI, or LAN cause port drivers to
enter information in the system error log in one of two formats:
- Device attention
Device-attention entries for the CI record
events that, in general, are indicated by the setting of a bit in a
hardware register. For the LAN, device-attention entries typically
record errors on a LAN adapter device.
- Logged message
Logged-message entries record the receipt of a
message packet that contains erroneous data or that signals an error
condition.
Sections C.11.3 and C.11.6 describe those formats.
C.11.3 CI Device-Attention Entries
Example C-6 shows device-attention entries for the CI. The left
column gives the name of a device register or a memory location. The
center column gives the value contained in that register or location,
and the right column gives an interpretation of that value.
Example C-6 CI Device-Attention Entries |
************************* ENTRY 83. **************************** (1)
ERROR SEQUENCE 10. LOGGED ON: SID 0150400A
DATE/TIME 15-JAN-1994 11:45:27.61 SYS_TYPE 01010000 (2)
DEVICE ATTENTION KA780 (3)
SCS NODE: MARS
CI SUB-SYSTEM, MARS$PAA0: - PORT POWER DOWN (4)
CNFGR 00800038
ADAPTER IS CI
ADAPTER POWER-DOWN
PMCSR 000000CE
MAINTENANCE TIMER DISABLE
MAINTENANCE INTERRUPT ENABLE
MAINTENANCE INTERRUPT FLAG
PROGRAMMABLE STARTING ADDRESS
UNINITIALIZED STATE
PSR 80000001
RESPONSE QUEUE AVAILABLE
MAINTENANCE ERROR
PFAR 00000000
PESR 00000000
PPR 03F80001
UCB$B_ERTCNT 32 (5)
50. RETRIES REMAINING
UCB$B_ERTMAX 32 (6)
50. RETRIES ALLOWABLE
UCB$L_CHAR 0C450000
SHAREABLE
AVAILABLE
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
UCB$W_STS 0010
ONLINE
UCB$W_ERRCNT 000B (7)
11. ERRORS THIS UNIT
|
The following table describes the device-attention entries in
Example C-6.
Entry |
Description |
(1)
|
The first two lines are the entry heading. These lines contain the
number of the entry in this error log file, the sequence number of this
error, and the identification number (SID) of this computer. Each entry
in the log file contains such a heading.
|
(2)
|
This line contains the date, the time, and the computer type.
|
(3)
|
The next two lines contain the entry type, the processor type (KA780),
and the computer's SCS node name.
|
(4)
|
This line shows the name of the subsystem and the device that caused
the entry and the reason for the entry. The CI subsystem's device PAA0
on MARS was powered down.
The next 15 lines contain the names of hardware registers in the
port, their contents, and interpretations of those contents. See the
appropriate CI hardware manual for a description of all the CI port
registers.
|
(5)
|
The UCB$B_ERTCNT field contains the number of reinitializations that
the port driver can still attempt. The difference between this value
and UCB$B_ERTMAX is the number of reinitializations already attempted.
|
(6)
|
The UCB$B_ERTMAX field contains the maximum number of times the port
can be reinitialized by the port driver.
|
(7)
|
The UCB$W_ERRCNT field contains the total number of errors that have
occurred on this port since it was booted. This total includes both
errors that caused reinitialization of the port and errors that did not.
|
C.11.4 Error Recovery
The CI port can recover from many errors, but not all. When an error
occurs from which the CI cannot recover, the following process occurs:
Step |
Action |
1
|
The port notifies the port driver.
|
2
|
The port driver logs the error and attempts to reinitialize the port.
|
3
|
If the port fails after 50 such initialization attempts, the driver
takes it off line, unless the system disk is connected to the failing
port or unless this computer is supposed to be a cluster member.
|
4
|
If the CI port is required for system disk access or cluster
participation and all 50 reinitialization attempts have been used, then
the computer bugchecks with a CIPORT-type bugcheck.
|
Once a CI port is off line, you can put the port back on line only by
rebooting the computer.
C.11.5 LAN Device-Attention Entries
Example C-7 shows device-attention entries for the LAN. The left
column gives the name of a device register or a memory location. The
center column gives the value contained in that register or location,
and the right column gives an interpretation of that value.
Example C-7 LAN Device-Attention Entry |
************************* ENTRY 80. **************************** (1)
ERROR SEQUENCE 26. LOGGED ON: SID 08000000
DATE/TIME 15-JAN-1994 11:30:53.07 SYS_TYPE 01010000 (2)
DEVICE ATTENTION KA630 (3)
SCS NODE: PHOBOS
NI-SCS SUB-SYSTEM, PHOBOS$PEA0: (4)
FATAL ERROR DETECTED BY DATALINK (5)
STATUS1 0000002C (6)
STATUS2 00000000
DATALINK UNIT 0001 (7)
DATALINK NAME 41515803 (8)
00000000
00000000
00000000
DATALINK NAME = XQA1:
REMOTE NODE 00000000 (9)
00000000
00000000
00000000
REMOTE ADDR 00000000 (10)
0000
LOCAL ADDR 000400AA (11)
4C07
ETHERNET ADDR = AA-00-04-00-07-4C
ERROR CNT 0001 (12)
1. ERROR OCCURRENCES THIS ENTRY
UCB$W_ERRCNT 0007
7. ERRORS THIS UNIT
|
The following table describes the LAN device-attention entries in
Example C-7.
Entry |
Description |
(1)
|
The first two lines are the entry heading. These lines contain the
number of the entry in this error log file, the sequence number of this
error, and the identification number (SID) of this computer. Each entry
in the log file contains such a heading.
|
(2)
|
This line contains the date and time and the computer type.
|
(3)
|
The next two lines contain the entry type, the processor type (KA630),
and the computer's SCS node name.
|
(4)
|
This line shows the name of the subsystem and component that caused the
entry.
|
(5)
|
This line shows the reason for the entry. The LAN driver has shut down
the data link because of a fatal error. The data link will be restarted
automatically, if possible.
|
(6)
|
STATUS1 shows the I/O completion status returned by the LAN driver.
STATUS2 is the VCI event code delivered to PEDRIVER by the LAN driver.
The event values and meanings are described in the following table:
Event Code |
Meaning |
1200
|
Port usable
|
1201
|
Port unusable
|
1202
|
Change address
|
If a message transmit was involved, the status applies to that transmit.
|
(7)
|
DATALINK UNIT shows the unit number of the LAN device on which the
error occurred.
|
(8)
|
DATALINK NAME is the name of the LAN device on which the error occurred.
|
(9)
|
REMOTE NODE is the name of the remote node to which the packet was
being sent. If zeros are displayed, either no remote node was available
or no packet was associated with the error.
|
(10)
|
REMOTE ADDR is the LAN address of the remote node to which the packet
was being sent. If zeros are displayed, no packet was associated with
the error.
|
(11)
|
LOCAL ADDR is the LAN address of the local node.
|
(12)
|
ERROR CNT. Because some errors can occur at extremely high rates, some
error log entries represent more than one occurrence of an error. This
field indicates how many. The errors counted occurred in the 3 seconds
preceding the timestamp on the entry.
|
|