OpenVMS Cluster Systems
F.11.3 Setting Up the Distributed Enable Filter
Use the values shown in Table F-15 to set up a filter, named
Distrib_Enable, for the distributed enable packet received event. Use
this filter to troubleshoot multiple LAN segments.
F.11.4 Setting Up the Distributed Trigger Filter
Use the values shown in Table F-16 to set up a filter, named
Distrib_Trigger, for the distributed trigger packet received event. Use
this filter to troubleshoot multiple LAN segments.
F.12 Messages
This section describes how to set up the distributed enable and
distributed trigger messages.
F.12.1 Distributed Enable Message
Table F-17 shows how to define the distributed enable message
(Distrib_Enable) by creating a new message. You must replace the source
address (nn nn nn nn nn nn) with the LAN address of the LAN
analyzer.
Table F-17 Setting Up the Distributed Enable Message (Distrib_Enable)
Field |
Byte Number |
Value |
ASCII |
Destination
|
1
|
01 4C 41 56 63 45
|
.LAVcE
|
Source
|
7
|
nn nn nn nn nn nn
|
|
Protocol
|
13
|
60 07
|
`.
|
Text
|
15
|
44 69 73 74 72 69 62 75 74 65
|
Distribute
|
|
25
|
64 20 65 6E 61 62 6C 65 20 66
|
d enable f
|
|
35
|
6F 72 20 74 72 6F 75 62 6C 65
|
or trouble
|
|
45
|
73 68 6F 6F 74 69 6E 67 20 74
|
shooting t
|
|
55
|
68 65 20 4C 6F 63 61 6C 20 41
|
he Local A
|
|
65
|
72 65 61 20 56 4D 53 63 6C 75
|
rea VMSclu
|
|
75
|
73 74 65 72 20 50 72 6F 74 6F
|
ster Proto
|
|
85
|
63 6F 6C 3A 20 4E 49 53 43 41
|
col: NISCA
|
F.12.2 Distributed Trigger Message
Table F-18 shows how to define the distributed trigger message
(Distrib_Trigger) by creating a new message. You must replace the
source address (nn nn nn nn nn nn) with the LAN address of the
LAN analyzer.
Table F-18 Setting Up the Distributed Trigger Message (Distrib_Trigger)
Field |
Byte Number |
Value |
ASCII |
Destination
|
1
|
01 4C 41 56 63 54
|
.LAVcT
|
Source
|
7
|
nn nn nn nn nn nn
|
|
Protocol
|
13
|
60 07
|
`.
|
Text
|
15
|
44 69 73 74 72 69 62 75 74 65
|
Distribute
|
|
25
|
64 20 74 72 69 67 67 65 72 20
|
d trigger
|
|
35
|
66 6F 72 20 74 72 6F 75 62 6C
|
for troubl
|
|
45
|
65 73 68 6F 6F 74 69 6E 67 20
|
eshooting
|
|
55
|
74 68 65 20 4C 6F 63 61 6C 20
|
the Local
|
|
65
|
41 72 65 61 20 56 4D 53 63 6C
|
Area VMScl
|
|
75
|
75 73 74 65 72 20 50 72 6F 74
|
uster Prot
|
|
85
|
6F 63 6F 6C 3A 20 4E 49 53 43
|
ocol: NISC
|
|
95
|
41
|
A
|
F.13 Programs That Capture Retransmission Errors
You can program the HP 4972 LAN Protocol Analyzer, as shown in the
following source code, to capture retransmission errors. The starter
program initiates the capture across all of the LAN analyzers. Only one
LAN analyzer should run a copy of the starter program. Other LAN
analyzers should run either the partner program or the scribe program.
The partner program is used when the initial location of the error is
unknown and when all analyzers should cooperate in the detection of the
error. Use the scribe program to trigger on a specific LAN segment as
well as to capture data from other LAN segments.
F.13.1 Starter Program
The starter program initially sends the distributed enable signal to
the other LAN analyzers. Next, this program captures all of the LAN
traffic, and terminates as a result of either a retransmitted packet
detected by this LAN analyzer or after receiving the distributed
trigger sent from another LAN analyzer running the partner program.
The starter program shown in the following example is used to initiate
data capture on multiple LAN segments using multiple LAN analyzers. The
goal is to capture the data during the same time interval on all of the
LAN segments so that the reason for the retransmission can be located.
Store: frames matching LAVc_all
or Distrib_Enable
or Distrib_Trigger
ending with LAVc_TR_ReXMT
or Distrib_Trigger
Log file: not used
Block 1: Enable_the_other_analyzers
Send message Distrib_Enable
and then
Go to block 2
Block 2: Wait_for_the_event
When frame matches LAVc_TR_ReXMT then go to block 3
Block 3: Send the distributed trigger
Mark frame
and then
Send message Distrib_Trigger
|
F.13.2 Partner Program
The partner program waits for the distributed enable; then it captures
all of the LAN traffic and terminates as a result of either a
retransmission or the distributed trigger. Upon termination, this
program transmits the distributed trigger to make sure that other LAN
analyzers also capture the data at about the same time as when the
retransmitted packet was detected on this segment or another segment.
After the data capture completes, the data from multiple LAN segments
can be reviewed to locate the initial copy of the data that was
retransmitted. The partner program is shown in the following example:
Store: frames matching LAVc_all
or Distrib_Enable
or Distrib_Trigger
ending with Distrib_Trigger
Log file: not used
Block 1: Wait_for_distributed_enable
When frame matches Distrib_Enable then go to block 2
Block 2: Wait_for_the_event
When frame matches LAVc_TR_ReXMT then go to block 3
Block 3: Send the distributed trigger
Mark frame
and then
Send message Distrib_Trigger
|
F.13.3 Scribe Program
The scribe program waits for the distributed enable and then captures
all of the LAN traffic and terminates as a result of the distributed
trigger. The scribe program allows a network manager to capture data at
about the same time as when the retransmitted packet was detected on
another segment. After the data capture has completed, the data from
multiple LAN segments can be reviewed to locate the initial copy of the
data that was retransmitted. The scribe program is shown in the
following example:
Store: frames matching LAVc_all
or Distrib_Enable
or Distrib_Trigger
ending with Distrib_Trigger
Log file: not used
Block 1: Wait_for_distributed_enable
When frame matches Distrib_Enable then go to block 2
Block 2: Wait_for_the_event
When frame matches LAVc_TR_ReXMT then go to block 3
Block 3: Mark_the_frames
Mark frame
and then
Go to block 2
|
Appendix G NISCA Transport Protocol Channel Selection and Congestion Control
G.1 NISCA Transmit Channel Selection
This appendix describes PEDRIVER running on OpenVMS Version 7.3 (Alpha
and VAX) and PEDRIVER running on earliers versions of OpenVMS Alpha and
VAX.
G.1.1 Multiple-Channel Load Distribution on OpenVMS Version 7.3 (Alpha and VAX) or Later
While all available channels with a node can be used to receive
datagrams from that node, not all channels are necessarily used to
transmit datagrams to that node. The NISCA protocol chooses a set of
equally desirable channels to be used for datagram transmission, from
the set of all available channels to a remote node. This set of
transmit channels is called the equivalent channel set
(ECS). Datagram transmissions are distributed in round-robin
fashion across all the ECS members, thus maximizing internode cluster
communications throughput.
G.1.1.1 Equivalent Channel Set Selection
When multiple node-to-node channels are available, the OpenVMS Cluster
software bases the choice of which set of channels to use on the
following criteria, which are evaluated in strict precedence order:
- Packet loss history
Channels that have recently been losing LAN
packets at a high rate are termed lossy and will be
excluded from consideration. Channels that have an acceptable loss
history are termed tight and will be further
considered for use.
- Capacity
Next, capacity criteria for the current set of tight
channels are evaluated. The capacity criteria are:
- Priority
Management priority values can be assigned both to
individual channels and to local LAN devices. A channel's priority
value is the sum of these management-assigned priority values. Only
tight channels with a priority value equal to, or one less than, the
highest priority value of any tight channel will be further considered
for use.
- Packet size
Tight, equivalent-priority channels whose maximum
usable packet size is equivalent to that of the largest maximum usable
packet size of any tight equivalent-priority channel will be further
considered for use.
A channel that satisfies all of these capacity criteria is
classified as a peer. A channel that is deficient with
respect to any capacity criteria is classified as
inferior. A channel that exceeds one or more of the
current capacity criteria, and meets the other capacity criteria is
classified as superior. Note that detection of a
superior channel will immediately result in recalculation of the
capacity criteria for membership. This recalculation will result in the
superior channel's capacity criteria becoming the ECS's capacity
criteria, against which all tight channels will be evaluated.
Similarly, if the last peer channel becomes unavailable or lossy,
the capacity criteria for ECS membership will be recalculated. This
will likely result in previously inferior channels becoming classified
as peers. Channels whose capacity values have not been evaluated
against the current ECS membership capacity criteria will sometimes be
classified as ungraded. Since they cannot affect the
current ECS membership criteria, lossy channels are marked as ungraded
as a computational expedient when a complete recalculation of ECS
membership is being performed.
- Delay
Channels that meet the preceding ECS membership criteria
will be used if their average round-trip delays are closely matched to
that of the fastest such channel---that is, they are
fast. A channel that does not meet the ECS membership
delay criteria is considered slow. The delay of
each channel currently in the ECS is measured using cluster
communications traffic sent using that channel. If a channel has not
been used to send a datagram for a few seconds, its delay will be
measured using a round-trip handshake. Thus, a lossy or slow channel
will be measured at intervals of a few seconds to determine whether its
delay, or datagram loss rate, has improved enough so that it meets the
ECS membership criteria.
Using the terminology introduced in this section, the ECS members are
the current set of tight, peer, and fast channels.
G.1.1.2 Local and Remote LAN Adapter Load Distribution
Once the ECS member channels are selected, they are ordered using an
algorithm that attempts to arrange them so as to use all local adapters
for packet transmissions before returning to reuse a local adapter.
Also, the ordering algorithm attempts to do the same with all remote
LAN adapters. Once the order is established, it is used round robin for
packet transmissions.
With these algorithms, PEDRIVER will make a best effort at utilizing
multiple LAN adapters on a server node that communicates continuously
with a client that also has multiple LAN adapters, as well as with a
number of clients. In a two-node cluster, PEDRIVER will actively
attempt to use all available LAN adapters that have usable LAN paths to
the other node's LAN adapters, and that have comparable capacity
values. Thus, additional adapters provide both higher availability and
alternative paths that can be used to avoid network congestion.
G.1.2 Preferred Channel (OpenVMS Version 7.2 and Earlier)
This section describes the transmit-channel selection algorithm used by
OpenVMS VAX and Alpha prior to OpenVMS Version 7.3.
All available channels with a node can be used to receive datagrams
from that node. PEDRIVER chooses a single channel on which to transmit
datagrams, from the set of available channels to a remote node.
The driver software chooses a transmission channel to each remote node.
A selection algorithm for the transmission channel makes a best effor
to ensure that messages are sent in the order they are expected to be
received. Sending the messages in this way also maintains compatibility
with previous versions of the operating system. The currently selected
transmission channel is called the preferred channel.
At any point in time, the TR level of the NISCA protocol can modify its
choice of a preferred channel based on the following:
- Minimum measured incoming delay
The NISCA protocol routinely
measures HELLO message delays and uses these measurements to pick the
most lightly loaded channel on which to send messages.
- Maximum datagram size
PEDRIVER favors channels with large
datagram sizes. For example, an FDDI-to-FDDI channel is favored over an
FDDI-to-Ethernet channel or an Ethernet-to-Ethernet channel. If your
configuration uses FDDI to Ethernet bridges, the PPC level of the NISCA
protocol segments messages into the smaller Ethernet datagram sizes
before transmitting them.
PEDRIVER continually uses received HELLO messages to compute the
incoming network delay value for each channel. Thus each channel's
incoming delay is recalculated at intervals of ~2 to ~3 seconds.
PEDRIVER then assumes that the network utilizes a broadcast medium (eg.
An Ethernet wire, or an FDDI ring). Thus incoming and outgoing delays
are symmetrical.
PEDRIVER switches the preferred channel based on observed network
delays or network component failures. Switching to a new transmission
channel sometimes causes messages to be received out of the desired
order. PEDRIVER uses a receive resequencing cache to reorder these
messages instead of discarding them, which eliminates unnecessary
retransmissions.
With these algorithms, PEDRIVER has a greater chance of utilizing
multiple adapters on a server node that communicates continuously with
a number of clients. In a two-node cluster, PEDRIVER will actively use
at most two LAN adapters: one to transmit and one to receive.
Additional adapters provide both higher availability and alternative
paths that can be used to avoid network congestion. As more nodes are
added to the cluster, PEDRIVER is more likely to use the additional
adapters.
G.2 NISCA Congestion Control
Network congestion occurs as the result of complex interactions of
workload distribution and network topology, including the speed and
buffer capacity of individual hardware components.
Network congestion can have a negative impact on cluster performance in
several ways:
- Moderate levels of congestion can lead to increased queue lengths
in network components (such as adapters and bridges) that in turn can
lead to increased latency and slower response.
- Higher levels of congestion can result in the discarding of packets
because of queue overflow.
- Packet loss can lead to packet retransmissions and, potentially,
even more congestion. In extreme cases, packet loss can result in the
loss of OpenVMS Cluster connections.
At the cluster level, these
congestion effects will appear as delays in cluster communications
(e.g. delays of lock transactions, served I/Os, ICC messages, etc.).
The user visable effects of network congestion can be application
response sluggishness, or loss of throughput.
Thus, although a particular network component or protocol cannot
guarantee the absence of congestion, the NISCA transport protocol
implemented in PEDRIVER incorporates several mechanisms to mitigate the
effects of congestion on OpenVMS Cluster traffic and to avoid having
cluster traffic exacerbate congestion when it occurs. These mechanisms
affect the retransmission of packets carrying user data and the
multicast HELLO datagrams used to maintain connectivity.
G.2.1 Congestion Caused by Retransmission
Associated with each virtual circuit from a given node is a
transmission window size, which indicates the number of packets that
can be outstanding to the remote node (for example, the number of
packets that can be sent to the node at the other end of the virtual
circuit before receiving an acknowledgment [ACK]).
If the window size is 8 for a particular virtual circuit, then the
sender can transmit up to 8 packets in a row but, before sending the
ninth, must wait until receiving an ACK indicating that at least the
first of the 8 has arrived.
If an ACK is not received, a timeout occurs, the packet is assumed
lost, and must be retransmitted. If another timeout occurs for a
retransmitted packet, the timeout interval is significantly increased
and the packet is retransmitted again. After a large number of
consecutive retransmissions of the same packet has occured, the virtual
circuit will be closed.
|