HP OpenVMS Cluster Systems

C.2.2 General OpenVMS Cluster Satellite-Boot Troubleshooting

If a satellite fails to boot, use the steps outlined in this section to diagnose and correct problems in OpenVMS Cluster systems.

Step Action

1 Verify that the boot device is available. This check is particularly important for clusters in which satellites boot from multiple system disks.

2 Verify that the DECnet network is up and running.

3 Check the cluster group code and password. The cluster group code and password are set using the CLUSTER_CONFIG.COM procedure.

4 Verify that you have installed the correct OpenVMS Integrity server, OpenVMS Alpha, and OpenvMS VAX licenses.

5 Verify system parameter values on each satellite node, as follows:
VAXCLUSTER = 2
NISCS_LOAD_PEA0 = 1
NISCS_USE_UDP=0 or 1 based on LAN or IP interconnect
NISCS_LAN_OVRHD = 0
NISCS_MAX_PKTSZ = 1498 ¹
SCSNODE is the name of the computer.
SCSSYSTEMID is a number that identifies the computer.
VOTES = 0

The SCS parameter values are set differently depending on your system configuration.
Reference: Appendix A describes how to set these SCS parameters.
To check system parameter values on a satellite node that cannot boot, invoke the SYSGEN utility on a running system in the OpenVMS Cluster that has access to the satellite node's local root. (Note that you must invoke the SYSGEN utility from a node that is running the same type of operating system---for example, to troubleshoot an Alpha satellite node, you must run the SYSGEN utility on an Alpha system.) Check system parameters as follows:

Step Action

A Find the local root of the satellite node on the system disk. The following example is from an Alpha system running DECnet for OpenVMS:
$ MCR NCP SHOW NODE HOME CHARACTERISTICS

Node Volatile Characteristics as of 10-JAN-1994 09:32:56
Remote node = 63.333 (HOME)

Hardware address = 08-00-2B-30-96-86
Load file = APB.EXE
Load Assist Agent = SYS$SHARE:NISCS_LAA.EXE
Load Assist Parameter = ALPHA$SYSD:[SYS17.]

The local root in this example is ALPHA$SYSD:[SYS17.].
Reference: Refer to the DECnet--Plus documentation for equivalent information using NCL commands.

B Enter the SHOW LOGICAL command at the system prompt to translate the logical name for ALPHA$SYSD.
$ SHO LOG ALPHA$SYSD
"ALPHA$SYSD" = "$69$DUA121:" (LNM$SYSTEM_TABLE)

C Invoke the SYSGEN utility on the system from which you can access the satellite's local disk. (This example invokes the SYSGEN utility on an Integrity server system or Alpha system using the parameter file IA64VMSSYS.PAR or ALPHAVMSSYS.PAR appropriately.) The following example illustrates how to enter the SYSGEN command USE with the system parameter file on the local root for the satellite node and then enter the SHOW command to query the parameters in question.
$ MCR SYSGEN

SYSGEN> USE $69$DUA121:[SYS17.SYSEXE]ALPHAVMSSYS.PAR
SYSGEN> SHOW VOTES
Parameter
Name Current Default Min. Max. Unit Dynamic
--------- ------- ------- --- ----- ---- -------
VOTES 0 1 0 127 Votes
SYSGEN> EXIT

¹For Ethernet adapters, the value of NISCS_MAX_PKTSZ is 1498. For Gigabit Ethernet and 10 Gb Ethernet adapters, the value is 8192.

C.2.3 MOP Server Troubleshooting

To diagnose and correct problems for MOP servers, follow the steps outlined in this section.

Step	Action
1	Perform the steps outlined in Section C.2.2.
2	Verify the NCP circuit state is on and the service is enabled. Enter the following commands to run the NCP utility and check the NCP circuit state. $ MCR NCP NCP> SHOW CIRCUIT ISA-0 CHARACTERISTICS Circuit Volatile Characteristics as of 12-JAN-1994 10:08:30 Circuit = ISA-0 State = on Service = enabled Designated router = 63.1021 Cost = 10 Maximum routers allowed = 33 Router priority = 64 Hello timer = 15 Type = Ethernet Adjacent node = 63.1021 Listen timer = 45
3	If service is not enabled, you can enter NCP commands like the following to enable it: NCP> SET CIRCUIT circuit-id STATE OFF NCP> DEFINE CIRCUIT circuit-id SERVICE ENABLED NCP> SET CIRCUIT circuit-id SERVICE ENABLED STATE ON The DEFINE command updates the permanent database and ensures that service is enabled the next time you start the network. Note that DECnet traffic is interrupted while the circuit is off.
4	Verify that the load assist parameter points to the system disk and the system root for the satellite.
5	Verify that the satellite's system disk is mounted on the MOP server node.
6	On Integrity server systems and Alpha systems, verify that the load file is APB.EXE.
7	For MOP booting, the satellite node's parameter file, (ALPHAVMSYS.PAR for Integrity and Alpha computers and VAXVMSSYS.PAR for VAX computers) must be located in the [SYSEXE] directory of the satellite system root.
8	Ensure that the file CLUSTER_AUTHORIZE.DAT is located in the [SYSCOMMON.SYSEXE] directory of the satellite system root.

C.2.4 Disk Server Troubleshooting

To diagnose and correct problems for disk servers, follow the steps outlined in this section.

Step	Action
1	Perform the steps in Section C.2.2.
2	For each satellite node, verify the following system parameter values: MSCP_LOAD = 1 MSCP_SERVE_ALL = 1
3	The disk servers for the system disk must be connected directly to the disk.

C.2.5 Satellite Booting Troubleshooting

To diagnose and correct problems for satellite booting, follow the steps outlined in this section.

Step Action

1 Perform the steps in Sections C.2.2, C.2.3, and C.2.4.

2 For each satellite node, verify that the VOTES system parameter is set to 0.

3 Verify the DECnet network database on the MOP servers by running the NCP utility and entering the following commands to display node characteristics. The following example displays information about an Alpha node named UTAH:
$ MCR NCP
NCP> SHOW NODE UTAH CHARACTERISTICS
Node Volatile Characteristics as of 15-JAN-1994 10:28:09
Remote node = 63.227 (UTAH)
Hardware address = 08-00-2B-2C-CE-E3
Load file = APB.EXE
Load Assist Agent = SYS$SHARE:NISCS_LAA.EXE
Load Assist Parameter = $69$DUA100:[SYS17.]

The load file must be APB.EXE. In addition, when booting Alpha nodes, for each LAN adapter specified on the boot command line, the load assist parameter must point to the same system disk and root number.

5 Verify the following information in the NCP display:

Step Action

A Verify the DECnet address for the node.

B Verify the load assist agent is SYS$SHARE:NISCS_LAA.EXE.

C Verify the load assist parameter points to the satellite system disk and correct root.

D Verify that the hardware address matches the satellite's Ethernet address. At the satellite's console prompt, use the information shown in Table 8-3 to obtain the satellite's current LAN hardware address.
Compare the hardware address values displayed by NCP and at the satellite's console. The values should be identical and should also match the value shown in the SYS$MANAGER:NETNODE_UPDATE.COM file. If the values do not match, you must make appropriate adjustments. For example, if you have recently replaced the satellite's LAN adapter, you must execute CLUSTER_CONFIG.COM CHANGE function to update the network database and NETNODE_UPDATE.COM on the appropriate MOP server.

6 Perform a conversational boot to determine more precisely why the satellite is having trouble booting. The conversational boot procedure displays messages that can help you solve network booting problems. The messages provide information about the state of the network and the communications process between the satellite and the system disk server.
Reference: Section C.2.6 describes booting messages for Alpha systems.

C.2.6 Alpha Booting Messages (Alpha Only)

On Alpha systems, the messages are displayed as shown in Table C-2.

Table C-2 Alpha Booting Messages (Alpha Only)
Message Comments

%VMScluster-I-MOPSERVER, MOP server for downline load was node UTAH

This message displays the name of the system providing the DECnet MOP downline load. This message acknowledges that control was properly transferred from the console performing the MOP load to the image that was loaded. If this message is not displayed, either the MOP load failed or the wrong file was MOP downline loaded.

%VMScluster-I-BUSONLINE, LAN adapter is now running 08-00-2B-2C-CE-E3

This message displays the LAN address of the Ethernet or FDDI adapter specified in the boot command. Multiple lines can be displayed if multiple LAN devices were specified in the boot command line. The booting satellite can now attempt to locate the system disk by sending a message to the cluster multicast address. If this message is not displayed, the LAN adapter is not initialized properly. Check the physical network connection. For FDDI, the adapter must be on the ring.

%VMScluster-I-VOLUNTEER, System disk service volunteered by node EUROPA AA-00-04-00-4C-FD

This message displays the name of a system claiming to serve the satellite system disk. This system has responded to the multicast message sent by the booting satellite to locate the servers of the system disk. If this message is not displayed, one or more of the following situations may be causing the problem:

The network path between the satellite and the boot server either is broken or is filtering the local area OpenVMS Cluster multicast messages.
The system disk is not being served.
The CLUSTER_AUTHORIZE.DAT file on the system disk does not match the other cluster members.

%VMScluster-I-CREATECH, Creating channel to node EUROPA 08-00-2B-2C-CE-E2 08-00-2B-12-AE-A2

This message displays the LAN address of the local LAN adapter (first address) and of the remote LAN adapter (second address) that form a communications path through the network. These adapters can be used to support a NISCA virtual circuit for booting. Multiple messages can be displayed if either multiple LAN adapters were specified on the boot command line or the system serving the system disk has multiple LAN adapters. If you do not see as many of these messages as you expect, there may be network problems related to the LAN adapters whose addresses are not displayed. Use the Local Area OpenVMS Cluster Network Failure Analysis Program for better troubleshooting (see Section D.5).

%VMScluster-I-OPENVC, Opening virtual circuit to node EUROPA

This message displays the name of a system that has established an NISCA virtual circuit to be used for communications during the boot process. Booting uses this virtual circuit to connect to the remote MSCP server.

%VMScluster-I-MSCPCONN, Connected to a MSCP server for the system disk, node EUROPA

This message displays the name of a system that is actually serving the satellite system disk. If this message is not displayed, the system that claimed to serve the system disk could not serve the disk. Check the OpenVMS Cluster configuration.

%VMScluster-W-SHUTDOWNCH, Shutting down channel to node EUROPA 08-00-2B-2C-CE-E3 08-00-2B-12-AE-A2

This message displays the LAN address of the local LAN adapter (first address) and of the remote LAN adapter (second address) that have just lost communications. Depending on the type of failure, multiple messages may be displayed if either the booting system or the system serving the system disk has multiple LAN adapters.

%VMScluster-W-CLOSEVC, Closing virtual circuit to node EUROPA

This message indicates that NISCA communications have failed to the system whose name is displayed.

%VMScluster-I-RETRY, Attempting to reconnect to a system disk server

This message indicates that an attempt will be made to locate another system serving the system disk. The LAN adapters will be reinitialized and all communications will be restarted.

%VMScluster-W-PROTOCOL_TIMEOUT, NISCA protocol timeout

Either the booting node has lost connections to the remote system or the remote system is no longer responding to requests made by the booting system. In either case, the booting system has declared a failure and will reestablish communications to a boot server.

**Table C-2 Alpha Booting Messages (Alpha Only)**
Message	Comments
%VMScluster-I-MOPSERVER, MOP server for downline load was node UTAH
This message displays the name of the system providing the DECnet MOP downline load. This message acknowledges that control was properly transferred from the console performing the MOP load to the image that was loaded.	If this message is not displayed, either the MOP load failed or the wrong file was MOP downline loaded.
%VMScluster-I-BUSONLINE, LAN adapter is now running 08-00-2B-2C-CE-E3
This message displays the LAN address of the Ethernet or FDDI adapter specified in the boot command. Multiple lines can be displayed if multiple LAN devices were specified in the boot command line. The booting satellite can now attempt to locate the system disk by sending a message to the cluster multicast address.	If this message is not displayed, the LAN adapter is not initialized properly. Check the physical network connection. For FDDI, the adapter must be on the ring.
%VMScluster-I-VOLUNTEER, System disk service volunteered by node EUROPA AA-00-04-00-4C-FD
This message displays the name of a system claiming to serve the satellite system disk. This system has responded to the multicast message sent by the booting satellite to locate the servers of the system disk.	If this message is not displayed, one or more of the following situations may be causing the problem: The network path between the satellite and the boot server either is broken or is filtering the local area OpenVMS Cluster multicast messages. The system disk is not being served. The CLUSTER_AUTHORIZE.DAT file on the system disk does not match the other cluster members.
%VMScluster-I-CREATECH, Creating channel to node EUROPA 08-00-2B-2C-CE-E2 08-00-2B-12-AE-A2
This message displays the LAN address of the local LAN adapter (first address) and of the remote LAN adapter (second address) that form a communications path through the network. These adapters can be used to support a NISCA virtual circuit for booting. Multiple messages can be displayed if either multiple LAN adapters were specified on the boot command line or the system serving the system disk has multiple LAN adapters.	If you do not see as many of these messages as you expect, there may be network problems related to the LAN adapters whose addresses are not displayed. Use the Local Area OpenVMS Cluster Network Failure Analysis Program for better troubleshooting (see Section D.5).
%VMScluster-I-OPENVC, Opening virtual circuit to node EUROPA
This message displays the name of a system that has established an NISCA virtual circuit to be used for communications during the boot process. Booting uses this virtual circuit to connect to the remote MSCP server.
%VMScluster-I-MSCPCONN, Connected to a MSCP server for the system disk, node EUROPA
This message displays the name of a system that is actually serving the satellite system disk.	If this message is not displayed, the system that claimed to serve the system disk could not serve the disk. Check the OpenVMS Cluster configuration.
%VMScluster-W-SHUTDOWNCH, Shutting down channel to node EUROPA 08-00-2B-2C-CE-E3 08-00-2B-12-AE-A2
This message displays the LAN address of the local LAN adapter (first address) and of the remote LAN adapter (second address) that have just lost communications. Depending on the type of failure, multiple messages may be displayed if either the booting system or the system serving the system disk has multiple LAN adapters.
%VMScluster-W-CLOSEVC, Closing virtual circuit to node EUROPA
This message indicates that NISCA communications have failed to the system whose name is displayed.
%VMScluster-I-RETRY, Attempting to reconnect to a system disk server
This message indicates that an attempt will be made to locate another system serving the system disk. The LAN adapters will be reinitialized and all communications will be restarted.
%VMScluster-W-PROTOCOL_TIMEOUT, NISCA protocol timeout
Either the booting node has lost connections to the remote system or the remote system is no longer responding to requests made by the booting system. In either case, the booting system has declared a failure and will reestablish communications to a boot server.

C.3 Computer Fails to Join the Cluster

If a computer fails to join the cluster, follow the procedures in this section to determine the cause.

C.3.1 Verifying OpenVMS Cluster Software Load

To verify that OpenVMS Cluster software has been loaded, follow these instructions:

Step Action

1 Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.

2 If no such messages are displayed, OpenVMS Cluster software probably was not loaded at boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2.

3 For OpenVMS Cluster systems communicating over the LAN or mixed interconnects, set NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in the computer's MODPARAMS.DAT file. (For more information about booting a computer in conversational mode, consult your installation and operations guide).

4 For OpenVMS Cluster systems on the LAN, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster (see Section 10.8.1).

Step	Action
1	Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.
2	If no such messages are displayed, OpenVMS Cluster software probably was not loaded at boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2.
3	For OpenVMS Cluster systems communicating over the LAN or mixed interconnects, set NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in the computer's MODPARAMS.DAT file. (For more information about booting a computer in conversational mode, consult your installation and operations guide).
4	For OpenVMS Cluster systems on the LAN, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster (see Section 10.8.1).

C.3.2 Verifying Boot Disk and Root

To verify that the computer has booted from the correct disk and system root, follow these instructions:

Step Action

1 If %CNXMAN messages are displayed, and if, after the conversational reboot, the computer still does not join the cluster, check the console output on all active computers and look for messages indicating that one or more computers found a remote computer that conflicted with a known or local computer. Such messages suggest that two computers have booted from the same system root.

3 If you find it necessary to modify the computer's bootstrap command procedure (console media), you may be able to do so on another processor that is already running in the cluster.
Replace the running processor's console media with the media to be modified, and use the Exchange utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information about examining and editing boot command files.

Step	Action
1	If %CNXMAN messages are displayed, and if, after the conversational reboot, the computer still does not join the cluster, check the console output on all active computers and look for messages indicating that one or more computers found a remote computer that conflicted with a known or local computer. Such messages suggest that two computers have booted from the same system root.
3	If you find it necessary to modify the computer's bootstrap command procedure (console media), you may be able to do so on another processor that is already running in the cluster. Replace the running processor's console media with the media to be modified, and use the Exchange utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information about examining and editing boot command files.

C.3.3 Verifying SCSNODE and SCSSYSTEMID Parameters

To be eligible to join a cluster, a computer must have unique SCSNODE and SCSSYSTEMID parameter values.

Step Action

1 Check that the current values do not duplicate any values set for existing OpenVMS Cluster computers. To check values, you can perform a conversational bootstrap operation.

2 If the values of SCSNODE or SCSSYSTEMID are not unique, do either of the following:

Alter both values.
Reboot all other computers.

Note: To modify values, you can perform a conversational bootstrap operation. However, for reliable future bootstrap operations, specify appropriate values for these parameters in the computer's MODPARAMS.DAT file.

WHEN you change... THEN...

The SCSNODE parameter Change the DECnet node name too, because both names must be the same.

Either the SCSNODE parameter or the SCSSYSTEMID parameter on a node that was previously an OpenVMS Cluster member Change the DECnet node number, too, because both numbers must be the same. Reboot the entire cluster.

C.3.4 Verifying Cluster Security Information

To verify the cluster group code and password, follow these instructions:

Step Action

1 Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.

2 For clusters with multiple system disks, ensure that the correct (same) group number and password were specified for each.
Reference: See Section 10.8 to view the group number and to reset the password in the CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.

Step	Action
1	Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.
2	For clusters with multiple system disks, ensure that the correct (same) group number and password were specified for each. Reference: See Section 10.8 to view the group number and to reset the password in the CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.

C.4 Startup Procedures Fail to Complete

If a computer boots and joins the cluster but appears to hang before startup procedures complete---that is, before you are able to log in to the system---be sure that you have allowed sufficient time for the startup procedures to execute.

IF...	THEN...
The startup procedures fail to complete after a period that is normal for your site.	Try to access the procedures from another OpenVMS Cluster computer and make appropriate adjustments. For example, verify that all required devices are configured and available. One cause of such a failure could be the lack of some system resource, such as NPAGEDYN or page file space.
You suspect that the value for the NPAGEDYN parameter is set too low.	Perform a conversational bootstrap operation to increase it. Use SYSBOOT to check the current value, and then double the value.
You suspect a shortage of page file space, and another OpenVMS Cluster computer is available.	Log in on that computer and use the System Generation utility (SYSGEN) to provide adequate page file space for the problem computer. Note: Insufficient page-file space on the booting computer might cause other computers to hang.
The computer still cannot complete the startup procedures.	Contact your HP support representative.

Contents

Index