hp Reliable Transaction Router
System Manager's Manual

5.7 Application Considerations

Although applications need not be directly concerned about shadowing matters, certain points must be considered when implementing performance boosting optimizations:

Anything specific to the executing node should not be stored in the database, since this would lead to diverging copies.
Any physical reference to the transaction which is unique to the executing server, e.g. Channel ID, system time, DB-key, etc., should not be passed back to the client for future references within its subsequent messages. This could lead to inconsistent handling when a different server is involved in shadow operations.
This consideration is also valid for recovery of non-shadowed servers.

For more information on designing applications, see the Tolerating Site Disaster section in the Reliable Transaction Router Application Design Guide.

5.8 Server States

The current state of a server can be examined using the SHOW SERVER/FULL command. For example,

RTR> show server/full Servers: Process-id: 13340 Facility: RTR$DEFAULT_FACILITY Channel: 131073 Flags: SRV State: active(1) Low Bound: High Bound: 87 13 rcpnam: "RTR$DEFAULT_CHANNEL" User Events: 0 RTR Events: 0 Partition-Id: 16777216 Process-id: 13340 Facility: RTR$DEFAULT_FACILITY Channel: 196610 Flags: SRV State: active Low Bound: 88 13 High Bound: 0f' rcpnam: "CHAN2" User Events: 0 RTR Events: 0 Partition-Id: 16777217

Server state

Figure 5-1 shows the backend server states that can occur and that appear in the State: field.

Figure 5-1 Backend Server States

5.9 Client States

The current state of a client process can be examined using the SHOW CLIENT/FULL command. For example,

RTR> show client/full Clients: Process-id: 13340 Facility: RTR$DEFAULT_FACILITY Channel: 458755 Flags: CLI State: declared(1) rcpnam: "CHAN3" User Events: 255 RTR Events: 0

Client state

Figure 5-2 shows the client states that can occur and that appear in the State: field.

Figure 5-2 Frontend Client States

5.10 Partition States

The current state of a partition can be examined using the SHOW PARTITION/FULL command on the routers and the backends. Using the /ROUTER qualifier shows the states as seen from the routers, and using the /BACKEND qualifier shows the states as seen from the backends.

Router partitions:

RTR> show partition/router/full Facility: RTR$DEFAULT_FACILITY State: ACTIVE(1) Low Bound: 0 High Bound: 4294967295 Failover policy: fail_to_standby Backends: node10 States: active(2) Primary Main: node10 Shadow Main:

Router state
Backend state

Backend partitions:

RTR> show partition/backend/full Partition name: RTR$DEFAULT_PARTITION_16777217 Facility: RTR$DEFAULT_FACILITY State: active(1) Low Bound: "aaaa" High Bound: "mmmm"(2) Active Servers: 0 Free Servers: 1(3) Transaction presentation: active Last Rcvy BE: Txns Active: 0 Txns Rcvrd: 0 Failover policy: fail_to_standby Key range ID: 16777217(4) Partition name: RTR$DEFAULT_PARTITION_16777218 Facility: RTR$DEFAULT_FACILITY State: active Low Bound: "nnnn" High Bound: "zzzz" Active Servers: 0 Free Servers: 1 Transaction presentation: active Last Rcvy BE: Txns Active: 0 Txns Rcvrd: 0 Failover policy: fail_to_standby Key range ID: 16777218

Backend server state
Key range for partition
Server application channels that are available
Key range or partition identification

Figure 5-3 shows the partition states that can occur and that appear in the State: field.

Figure 5-3 Router Partition States

Chapter 6
Troubleshooting RTR Applications

This chapter contains information useful for analyzing performance aspects of RTR, especially in large configurations.

To manage remote nodes, you must have either proxy accounts or rsh access to them. Use RTR remote commands to manage remote nodes.

You should also add and grant operator privileges to the accounts used to manage the RTR network.

6.1 RTR Monitor Pictures

RTR supplies many monitor pictures to help you troubleshoot your application. To display a monitor picture, use the following command at the RTR prompt:

RTR> MONITOR picture-name

The following table provides suggested monitor pictures to display when you encounter problems:

For this type of failure: Use these monitor pictures:

Most common problems SYSTEM

Connection failures ACCFAIL, CONNECTS, FRONTEND, LINK, NETSTAT, STALLS

Transaction sequence problems CALLS

Channel problems CALLS, CHANNEL, PARTIT

Quorum problems QUORUM, ROLEQUOR

V2 interface API V2CALLS

Journal problems JCALLS, JOURNAL

API problems APP2ACP, ACP2APP, REJECTS, REJHIST, ROUTERS

XA interface problems XA

Application Problems APP2ACP, ACP2APP, CALLS, CHANNEL, PARTIT, REJECTS, REJHIST, ROUTERS

For this type of failure:	Use these monitor pictures:
Most common problems	SYSTEM
Connection failures	ACCFAIL, CONNECTS, FRONTEND, LINK, NETSTAT, STALLS
Transaction sequence problems	CALLS
Channel problems	CALLS, CHANNEL, PARTIT
Quorum problems	QUORUM, ROLEQUOR
V2 interface API	V2CALLS
Journal problems	JCALLS, JOURNAL
API problems	APP2ACP, ACP2APP, REJECTS, REJHIST, ROUTERS
XA interface problems	XA
Application Problems	APP2ACP, ACP2APP, CALLS, CHANNEL, PARTIT, REJECTS, REJHIST, ROUTERS

See Chapter 7 for descriptions and examples of the monitor pictures, and Chapter 8 for the full syntax of the MONITOR command.

6.2 Enabling RTR Logging

Many problems can be better analyzed when RTR logging has been enabled.

RTR logging output can be directed to a file, for example, on RTR startup.

$ RTR SET LOG /FILE=logfile.dat

You should monitor the size of the log file; archive and purge as necessary.

6.3 Starting a Facility

When a facility is started or restarted and servers are declared, RTR recovery features require that it searches journal files of backend nodes in the facility. This allows recovery of any incomplete transactions that were in-flight when the facility last existed. However, if some of the facility's recovery information exists on a backend that is not available at startup, RTR waits for access to the journal on that backend and thus appears to "hang".

This situation can be detected by using MONITOR RECOVERY; backend nodes will be waiting for access to recovery journals. If this is the case, you may follow one of these procedures to continue the startup:

Delete the facility and recreate it without the unavailable backend.
Begin the startup by creating a smaller facility and using the EXTEND FACILITY and TRIM FACILITY commands.
Force a partition to abandon recovery with a SET PARTITION command.

6.4 Analyzing RTR Application Performance

This section provides guidance for System Managers who are analyzing an RTR application that is not functioning correctly.

If an application using RTR hangs, use the following checklist to analyze the situation.

Is there a system-level problem on the node concerned, such as a full disk?

Has RTR been started? Is RTR running correctly?

$ RTR SHOW RTR RTR running on node MYNODE in SYSTEM mode

Are the application programs running? RTR lists the processes using RTR with the following command:
$ RTR SHOW PROCESS
The user application processes should be in this list.
Has the application stopped?
Use MONITOR SYSTEM to check for problems. If it indicates a problem with a subsystem, you can get additional information by monitoring that subsystem.
Network partitioning can also be a problem; this can happen if half or fewer of the configured backends and routers are reachable. To recognize network partitioning, use the MONITOR QUORUM picture. If the number of retries keep increasing without a corresponding increase in the reason counters (CNF/RCH/QRT), you have a partitioned network.
To check the individual links, use the MONITOR CONNECTS picture. This picture displays the link protocol for connected links, and the reason for a failed connection on any links.
Are the application programs running correctly? Use MONITOR CALLS to examine the state of the participating application processes.
- Does the number of rtr_open_channel calls match the number of rtr_mt_opened messages ? If they do not match, use the MONITOR CONNECTS picture to check individual links.
- Use MONITOR CONNECTS to make sure the connection to the router is OK.
- Look in the RTR log file for error messages concerning any unconnected node.
- Look at any unconnected nodes found, and determine:
  - Is RTR running?
  - Has the RTR command CREATE FACILITY been issued?
  - Are there DECnet problems, e.g., executor maximum links too low? Are the router nodes reachable?
Is a server waiting for an rtr_mt_accepted or rtr_mt_rejected message (in other words, has it voted, but not yet received confirmation of the outcome of the transaction)? This is most likely a problem with the application logic. Also check the database for a possible deadlock situation.
Is a client channel declaration not completing? Client channels need to have connectivity via a router node to at least one server channel before they get an mt_opened message. If the server is up and running, use MONITOR QUORUM and MONITOR CONNECTS to check connectivity.
Has a client channel called rtr_receive_message waiting for an rtr_mt_accepted or rtr_reply_to_client message and not received it within a reasonable time period? Check the application logic and the database for a deadlock.
Has a client channel called rtr_receive_message expecting an rtr_mt_accepted or rtr_mt_rejected message that is not forthcoming? If yes, RTR is awaiting the necessary resources for message transmission to the backend servers. Reasons could be:
- Congestion of a network link, frontend to router or router to backend
- Server application not correctly dequeuing messages
- System-level problem on router or backend node
Use MONITOR TPS to check the transaction processing rate of each process on a system. A system's capacity is generally expressed as the throughput of the servers. If the rates are low or sporadic, contention may be the cause. For systems with throughput less than 10 tps, the MONITOR TPSLO display provides greater granularity in the associated bar graph.
Adding server instances can often decrease applicaton throughput if transactions all access common data elements. Partitioning data so that server instances do not interfere with each other is one way to resolve database contention.
Use the command SHOW PARTITION/FULL to display the backlog of transactions on a server pool (partition). If the number of free servers is continually zero, the arrival rate of transactions is greater than the processing capacity of the existing server pool.
The MONITOR QUEUES picture also shows monitor backlogs. This display shows queuing by partition. If the service time and arrival rate of transactions are large, there are not enough servers to process the load. The remedy is to start additional server instances or decrease the processing time of each transaction. Also, many transactions or messages queued can be caused by contention which is limiting the efficiency of servers.
Check the state of links with:
$ RTR SHOW FACILITY /LINK
Check if there are sufficient concurrent application server channels to handle the transaction load; messages may have to be queued for long periods before being processed.
Use MONITOR QUEUES to check the number of outstanding messages for each partition.
Check for congestion by examining the network links with the longest delays by using MONITOR TRAFFIC.
Use the command MONITOR STALLS to determine if the network needs tuning.
If there is no congestion, use MONITOR FLOW to discover if a link has credits for data traffic, or if the application requires more bandwidth than is available.
If the RTRACP dies when adding a facility (which has a backend role on the node), suspect journal file difficulties. Ensure that the journal file is not corrupted, or incompatible with the running RTR version. In the event of journal file corruption, please contact your HP support office.

6.5 Server Crashes

Analyze the reasons why the server crashed before you restart the server. Failures that cascade could present a problem, but note that doing a restart will prevent failover.

6.6 Link Connect Failures

The following table explains the meaning of link connect failure codes:

Code Text Implications

NOTRECOGNISED Node not recognized Remote node that received the connection request does not have the local node in its RTR configuration.

REFUSED Connection refused Indicates one of the following conditions on the remote system: either RTR is not running, or a requested network protocol is not installed.

FACNOTDEC Facility not declared The requested facility is not configured on the remote node.

NODENOTCFG Node not configured The remote node has the local node in its configuration, though not as part of the requested facility.

ROLESMISMATCH Roles mismatch The remote node has the local node configured in the requested facility, but in a role other than the one requested.

Code	Text	Implications
NOTRECOGNISED	Node not recognized	Remote node that received the connection request does not have the local node in its RTR configuration.
REFUSED	Connection refused	Indicates one of the following conditions on the remote system: either RTR is not running, or a requested network protocol is not installed.
FACNOTDEC	Facility not declared	The requested facility is not configured on the remote node.
NODENOTCFG	Node not configured	The remote node has the local node in its configuration, though not as part of the requested facility.
ROLESMISMATCH	Roles mismatch	The remote node has the local node configured in the requested facility, but in a role other than the one requested.

Any of the above errors can occur as the result of the connection request arriving at the wrong node for any of the following reasons:

You may have mistyped a name on either (or both) the local and remote nodes.
DNS at either the local or remote nodes (or both) may be giving incorrect address information for the names used.
The nodes are not in the same RTR group.
The nodes do not support a common network protocol.
There is a possible wildcarding error (on routers) in the facility definition.
Wildcarding or use of the "tunnel." prefix may be necessary due to a firewall.

6.7 Rejected Transactions

The following table explains the meaning of rejected transaction codes:

Code Text Implications

NODSTFND No destination found Primary and all alternate servers for a partition cannot be reached by the client application. This situation can be caused by network problems or services which have not been started or have crashed.

JNLFULL Journal full May occur when the RTR journal is full. Note that RTR reserves a percentage of the journal to ensure that in-progress transactions can be completed. The JNLFULL error is most likely to be seen with shadow servers running in remember mode, but can also be caused by many transactions being queued to an unresponsive server.

DLKTXRES Deadlock detected transaction rescheduled May occur during the commit cycle for multi-participant transactions or in extreme failover situations when the order of transactions must be corrected. This reject reason indicates that two transactions were interfering with each other. RTR rejects one branch of the offending transactions to clear the deadlock. Since this transaction branch is subsequently rescheduled by RTR, this reject can be considered informational.

TIMEOUT Time out Occurs if the rtr_send_to_server timeout provided by the client application expires. This reject indicates poor responsiveness by the service.

Code	Text	Implications
NODSTFND	No destination found	Primary and all alternate servers for a partition cannot be reached by the client application. This situation can be caused by network problems or services which have not been started or have crashed.
JNLFULL	Journal full	May occur when the RTR journal is full. Note that RTR reserves a percentage of the journal to ensure that in-progress transactions can be completed. The JNLFULL error is most likely to be seen with shadow servers running in remember mode, but can also be caused by many transactions being queued to an unresponsive server.
DLKTXRES	Deadlock detected transaction rescheduled	May occur during the commit cycle for multi-participant transactions or in extreme failover situations when the order of transactions must be corrected. This reject reason indicates that two transactions were interfering with each other. RTR rejects one branch of the offending transactions to clear the deadlock. Since this transaction branch is subsequently rescheduled by RTR, this reject can be considered informational.
TIMEOUT	Time out	Occurs if the `rtr_send_to_server` timeout provided by the client application expires. This reject indicates poor responsiveness by the service.

6.8 Using the Snapshot Procedure

Certain difficulties can be more easily investigated if a snapshot of the problem node is made. Make a snapshot if the application hangs, causes delays, or seems to be causing other problems.

OpenVMS

On OpenVMS systems, a snapshot is made by executing a command file:

$ @SYS$MANAGER:RTR$SNAPSHOT.COM

The output is a file named nodename _RTR_DIAGS.TMP.

Information in this file can help to determine the possible causes of a fault (OpenVMS, DECnet, RTR, environment, database, application, and so on.) The information includes numerous RTR monitor pictures, executable image versions, process states, and so on.

UNIX

On UNIX systems, make a snapshot by entering the following command on the problem node:

# rtr_snapshot.sh

The information displayed on the screen includes many RTR monitor pictures, executable image versions, process states, and other information.

Windows

To take a snapshot on Windows, click on the Snapshot icon on the RTR menu. A DOS-style window with the title "Snapshot" appears as the snapshot is taken. The file rtr_snapshot.log is created in the directory where RTR runs, for example,
C:\Program Files\HP\RTR . You can read this file with an editor such as Notepad.

Note

If using Microsoft Windows Scripting Host, the minimum version is 5.6 for use with RTR. With an earlier version of the Scripting Host, RTR snapshot will run with reduced functionality.

To obtain the latest Scripting Host software, use the Microsoft download center at
http://msdn.microsoft.com/downloads .

Sun

To take a snapshot on Sun, use fssnap . This copies the original filesystem blocks into a file as they are changed, with some performance degradation.

6.9 Generating a Process Dump

OpenVMS Systems

Certain potential difficulties can be more easily investigated by RTR Support if a dump of the RTR ACP is available. It shows diagnostics generated if unhandled exceptions occur.

The file SYS$MANAGER:RTR$STARTUP.COM can be altered to include the definition of the logical name RTR$DUMP_DIRECTORY which specifies the device and directory where the dump is to be generated.

Since an RTR dump file typically uses about 5000 blocks, enough space should be available on the chosen disk. For a very large node installation, or a large number of links, the dump file may be up to 20,000 blocks.

To prepare for dump creation, make sure that:

SYS$SHARE:IMGDMP has been installed
RTR has been started from an account having CMKRNL privilege
The account used to start RTR has write access to the directory specified by RTR$DUMP_DIRECTORY

An RTR ACP dump can be created as follows:

$ RTR RTR> SET MODE /UNSUPPORTED RTR> DEBUG ACP ^G RTR> SET MODE /NOUNSUPPORTED ^Z $

Unsupported commands should be used with care.

UNIX Systems

UNIX core files are generated with no special configuration, but their name and location may vary depending on operating system settings and how RTR is started up. The file rtr_error*.log is usually created in /rtr .

Windows Systems

On Windows systems, a process dump file can be generated by enabling the Dr. Watson post-mortem crash analyzer. This is done by entering the MS-DOS command:

(%WINDIR%\drwtsn32 -i)

The files created are %WINDIR%\DRWTSN32.LOG and %WINDIR%\USER.DMP.

These files should be included with any problem report submitted to RTR Engineering in the event of an RTR crash, along with the RTR dump file (RTR_<n>.DMP) and the RTR log file. The file rtr_error*.log is also created. Send in *.log files when reporting an error, if running RTR with logging in use.

Contents

Index

hp Reliable Transaction RouterSystem Manager's Manual