HP OpenVMS Systems Documentation

OpenVMS Performance Management

Chapter 9
Evaluating the CPU Resource

The CPU is the central resource in your system and it is the most costly to augment. Good CPU performance is vital to that of the system as a whole, because the CPU performs the two most basic system functions: it allocates and initiates the demand for all the other resource, and it provides instruction execution service to user processes.

This chapter discusses the following topics:

Evaluating CPU responsiveness
Improving CPU responsiveness

9.1 Evaluating CPU Responsiveness

Only one process can execute on a CPU at a time, so the CPU resource must be shared sequentially. Because several processes can be ready to use the CPU at any given time, the system maintains a queue of processes waiting for the CPU.

These processes are in the compute (COM) or compute outswapped (COMO) scheduling states.

9.1.1 Quantum

The system allocates the CPU resource for a period of time known as a quantum to each process that is not waiting for other resources.

During its quantum, a process can execute until any of the following events occur:

The process is preempted by a higher priority process.
The process voluntarily yields the CPU by requesting a wait state for some purpose (for example, to wait for the completion of a user I/O request).
The process enters an involuntary wait state, such as when it triggers a hard page fault (one that must be satisfied by reading from disk).

9.1.2 CPU Response Time

A good measure of the CPU response is the average number of processes in the COM and COMO states over time---that is, the average length of the compute queue.

If the number of processes in the compute queue is close to zero, unblocked processes will rarely need to wait for the CPU.

Several factors affect how long any given process must wait to be granted its quantum of CPU time:

Interrupt state
Computing requirements of the processes in the compute queue
CPU type
Scheduling priority

The worst-case scenario involves a large compute queue of compute-bound processes. Each compute-bound process can retain the CPU for the entire quantum period.

Compute-Bound Processes

Assuming no interrupt time and a default quantum of 200 milliseconds, a group of five compute-bound processes of the same priority (one in CUR state and the others in COM state) acquires the CPU once every second.

As the number of such processes increases, there is a proportional increase in the waiting time.

If the processes are not compute bound, they can relinquish the CPU before having consumed their quantum period, thus reducing waiting time for the CPU.

Because of MONITOR's sampling nature, the utility rarely detects processes that remain only briefly in the COM state. Thus, if MONITOR shows COM processes, you can assume they are the compute-bound type.

9.1.3 Determining Optimal Queue Length

The best way to determine a reasonable length for the compute queue at your site is to note its length during periods when all the system resources are performing adequately and when users perceive response time to be satisfactory.

Then, watch for deviations from this value and try to develop a sense for acceptable ranges.

9.1.4 Estimating Available CPU Capacity

To estimate available CPU capacity, observe the average amount of idle time and the average number of processes in the various scheduling wait states.

While idle time is a measure of the percentage of unused CPU time, the wait states indicate the reasons that the CPU was idle and might point to utilization problems with other resources.

Overcommitted Resources

Before using idle time to estimate growth potential or as an aid to balancing the CPU resource among processes in an OpenVMS Cluster, ensure that the other resources are not overcommitted, thereby causing the CPU to be underutilized.

Scheduling Wait States

Whenever a process enters a scheduling wait state---a state other than CUR (process currently using the CPU) and COM---it is said to be blocked from using the CPU.

Most times, a process enters a wait state as part of the normal synchronization that takes place between the CPU and the other resources.

But certain wait states can indicate problems with those other resources that could block viable processes from using the CPU.

MONITOR data on the scheduling wait states provides clues about potential problems with the memory and disk I/O resources.

9.1.5 Types of Scheduling Wait States

There are two types of scheduling wait states---voluntary and involuntary. Processes enter voluntary wait states directly; they are placed in involuntary wait states by the system.

9.1.5.1 Voluntary Wait States

Processes in the local event flag wait (LEF) state are said to be voluntarily blocked from using the CPU; that is, they are temporarily requesting to wait before continuing with CPU service. Since the LEF state can indicate conditions ranging from normal waiting for terminal command input to waiting for I/O completion or locks, you can obtain no useful information about potentially harmful blockage simply by observing the number of processes in that state. You can usually assume, though, that most of them are waiting for terminal command input (at the DCL prompt).

Disk I/O Completion

Some processes might enter the LEF state because they are awaiting I/O completion on a disk or other peripheral device. If the I/O subsystem is not overloaded, this type of waiting is temporary and inconsequential. If, on the other hand, the I/O resource, particularly disk I/O, is approaching capacity, it could be causing the CPU to be seriously underutilized.

Long disk response times are the clue that certain processes are in the LEF state because they are experiencing long delays in acquiring disk service. If your system exhibits unusually long disk response times, refer to Section 7.2.1 and try to correct that problem before attempting to improve CPU responsiveness.

Waiting for a Lock

Other processes in the LEF state might be waiting for a lock to be granted. This situation can arise in environments where extensive file sharing is the norm---particularly in OpenVMS Clusters. Check the ENQs Forced to Wait Rate. (This is the rate of $ENQ lock requests forced to wait before the lock was granted.) Since the statistic gives no indication of the duration of lock waits, it does not provide direct information about lock waiting. A value significantly larger than your system's normal value, however, can indicate that users will start to notice delays.

On large SMP systems, it might improve performance to give one CPU all lock manager work. If you have a high CPU count and a high amount time spent synchronizing mulitple CPU's, consider implementing a dedicated lock manager as described in Section 13.2.

If you suspect...	Then...
The lock waiting is caused by file sharing ¹	Attempt to reduce the level of sharing.
The lock waiting results from user or third-party application locks	Attempt to influence the redesign of such applications.
A high amount of locking activity in an SMP environment	Assign a CPU to perform dedicated lock management.

¹RMS and the XQP use locks to synchronize record and file access.

Process Synchronization

Processes can also enter the LEF state or the other voluntary wait states (common event flag wait [CEF], hibernate [HIB], and suspended [SUSP]) when system services are used to synchronize applications. Such processes have temporarily abdicated use of the CPU; they do not indicate problems with other resources.

9.1.5.2 Involuntary Wait States

Involuntary wait states are not requested by processes but are invoked by the system to achieve process synchronization in certain circumstances:

The free page wait (FPG), page fault wait (PFW), and collided page wait (COLPG) states are associated with memory management and are discussed in Section 7.2.1.
The Mutex wait state (indicated by the state keyword MUTEX in the MONITOR PROCESSES display) is a temporary wait state and is not discussed here.
The miscellaneous resource wait (MWAIT) state is discussed in the following section.

MWAIT State

The presence of processes in the MWAIT state indicates that there might be a shortage of a systemwide resource (usually page or swapping file capacity) and that the shortage is blocking these processes from the CPU.

If you see processes in this state, do the following:

Check the type of resource wait by examining the MONITOR PROCESSES data available in the collected recording files.
Check the resource wait states by playing back the data files and examining each PROCESSES display. Note that a standard summary report contains only the last PROCESSES display and the multifile summary report contains no PROCESSES data.
Issue a MONITOR command like the following:
$ MONITOR /INPUT=SYS$MONITOR:file-spec /VIEWING_TIME=1 PROCESSES
This command will display all the PROCESSES data available in the input file.
Look for RWxxx scheduling states, where xxx is a three-character code indicating the depleted resource for which the process is waiting. (The codes are listed in the OpenVMS System Management Utilities Reference Manual under the description of the STATES class in the MONITOR section.)

Types of Resource Wait States

The most common types of resource waits are those signifying depletion of the page and swapping files as shown in the following table:

State	Description
RWSWP	Indicates a swapping file of deficient size.
RWMBP, RWMPE, RWPGF	Indicates a paging file that is too small.
RWAST	Indicates that the process is waiting for a resource whose availability will be signaled by delivery of an asynchronous system trap (AST). In most instances, either an I/O operation is outstanding (incomplete), or a process quota has been exhausted.

You can determine paging and swapping file sizes and the amount of available space they contain by entering the SHOW MEMORY/FILES/FULL command.

The AUTOGEN feedback report provides detailed information about paging and swapping file use. AUTOGEN uses the data in the feedback report to resize or to recommend resizing the paging and swapping files.

9.2 Detecting CPU Limitations

The surest way to determine whether a CPU limitation could be degrading performance is to check for a state queue with the MONITOR STATES command. See Figure A-16. If any processes appear to be in the COM or COMO state, a CPU limitation may be at work. However, if no processes are in the COM or COMO state, you need not investigate the CPU limitation any further.

If processes are in the COM or COMO state, they are being denied access to the CPU. One or more of the following conditions is occurring:

Processes are blocked by the execution of another process at higher priority.
Processes are time slicing with other processes at the same priority.
Processes are blocked by excessive activity in interrupt state.
Processes are blocked by some other resource. (Note that this last possibility means the limitation is not a CPU limitation but is instead a memory or I/O limitation.)

9.2.1 Higher Priority Blocking Processes

If you suspect the system is performing suboptimally because processes are blocked by a process running at higher priority, do the following:

Gain access to an account that is already running.
Ensure you have the ALTPRI privilege.
Set your priority to 15 with the DCL command SET PROCESS/PRIORITY=15.
Enter the DCL command MONITOR PROCESSES/TOPCPU to check for a high-priority lockout.
Enter the DCL command SHOW PROCESS/CONTINUOUS to examine the current and base priorities of those processes that you found were top users of the CPU resource. You can now conclude whether any process is responsible for blocking lower priority processes.
Restore the priority of the process you used for the investigation. Otherwise, you may find that process causes its own system performance problem.

If you find that this condition exists, your option is to adjust the process priorities. See Section 13.3 for a discussion of how to change the process priorities assigned in the UAF, define priorities in the login command procedure, or change the priorities of processes while they execute.

9.2.2 Time Slicing Between Processes

Once you rule out the possibility of preemption by higher priority processes, you need to determine if there is a serious problem with time slicing between processes at the same priority. Using the list of top CPU users, compare the priorities and assess how many processes are operating at the same one. Refer to Section 13.3, if you conclude that the priorities are inappropriate.

However, if you decide that the priorities are correct and will not benefit from such adjustments, you are confronted with a situation that will not respond to any form of system tuning. Again, the only appropriate solution here is to adjust the work load to decrease the demand or add CPU capacity (see Section 13.7).

9.2.3 Excessive Interrupt State Activity

If you discover that blocking is not due to contention with other processes at the same or higher priorities, you need to find out if there is too much activity in interrupt state. In other words, is the rate of interrupts so excessive that it is preventing processes from using the CPU?

You can determine how much time is spent in interrupt state from the MONITOR MODES display. A percentage of time in interrupt state less than 10 percent is moderate; 20 percent or more is excessive. (The higher the percentage, the more effort you should dedicate to solving this resource drain.)

If the interrupt time is excessive, you need to explore which devices cause significant numbers of interrupts on your system and how you might reduce the interrupt rate.

The decisions you make will depend on the source of heavy interrupts. Perhaps they are due to communications devices or special hardware used in real-time applications. Whatever the source, you need to find ways to reduce the number of interrupts so that the CPU can handle work from other processes. Otherwise, the solution may require you to adjust the work load or acquire CPU capacity (see Section 13.7).

9.2.4 Disguised Memory Limitation

Once you have either ruled out or resolved a CPU limitation, you need to determine which other resource limitation produces the block. Your next check should be for the amount of idle time. See Figure A-17. Use the MONITOR MODES command. If there is any idle time, another resource is the problem and you may be able to tune for a solution. If you reexamine the MONITOR STATES display, you will likely observe a number of processes in the COMO state. You can conclude that this condition reflects a memory limitation, not a CPU limitation. Follow the procedures described in Chapter 7 to find the cause of the blockage, and then take the corrective action recommended in Chapter 10.

9.2.5 Operating System Overhead

If the MONITOR MODES display indicates that there is no idle time, your CPU is 100 percent busy. You will find that processes are in the COM state on the MONITOR STATES display. You must answer one more question. Is the CPU being used for real work or for nonessential operating system functions? If there is operating system overhead, you may be able to reduce it.

Analyze the MONITOR MODES display carefully. If your system exhibits excessive kernel mode activity, it is possible that the operating system is incurring overhead in the areas of memory management, I/O handling, or scheduling. Investigate the memory limitation and I/O limitation (Chapters 7 and 8), if you have not already done so.

Once you rule out the possibility of improving memory management or I/O handling, the problem of excessive kernel mode activity might be due to scheduling overhead. However, you can do practically nothing to tune the scheduling function. There is only one case that might respond to tuning. The clock-based rescheduling that can occur at quantum end is costlier than the typical rescheduling that is event driven by process state. Explore whether the value of the system parameter QUANTUM is too low and can be increased to bring about a performance improvement by reducing the frequency of this clock-based rescheduling (see Section 13.4). If not, your only other recourse is to adjust the work load or acquire CPU capacity (see Section 13.7).

9.2.6 RMS Misused

If the MONITOR MODES display indicates that a great deal of time is spent in executive mode, it is possible that RMS is being misused. If you suspect this problem, proceed to the steps described in Section 8.3.3 for RMS induced I/O limitations, making any changes that seem indicated. You should also consult the Guide to OpenVMS File Applications.

9.2.7 CPU at Full Capacity

If at this point in your investigation the MONITOR MODES display indicates that most of the time is spent in supervisor mode or user mode, you are confronted with a situation where the CPU is performing real work and the demand exceeds the capacity. You must either make adjustments in the work load to reduce demand (by more efficient coding of applications, for example) or you must add CPU capacity (see Section 13.7).

9.3 MONITOR Statistics for the CPU Resource

Use the following MONITOR commands to obtain the appropriate statistic:

Command	Statistic
Compute Queue
STATES	Number of processes in compute (COM) and compute outswapped (COMO) scheduling states
Estimating CPU Capacity
STATES	All items
MODES	Idle time
Voluntary Wait States
STATES	Number of processes in local event flag wait (LEF), common event flag wait (CEF), hibernate (HIB), and suspended (SUSP) states
LOCK	ENQs Forced to Wait Rate
MODES	MP synchronization
Involuntary Wait States
STATES	Number of processes in miscellaneous resource wait (MWAIT) state
PROCESSES	Types of resource waits (RW xxx)
Reducing CPU Consumption
MODES	All items
Interrupt State
IO	Direct I/O Rate, Buffered I/O Rate, Page Read I/O Rate, Page Write I/O Rate
DLOCK	All items
SCS	All items
MP Synchronization Mode
MODES	MP Synchronization
IO	Direct I/O Rate, Buffered I/O Rate
DLOCK	All items
PAGE	All items
DISK	Operation Rate
Kernel Mode
MODES	Kernel mode
IO	Page Fault Rate, Inswap Rate, Logical Name Translation Rate
LOCK	New ENQ Rate, Converted ENQ Rate, DEQ Rate
FCB	All items
PAGE	Demand Zero Fault Rate, Global Valid Fault Rate, Page Read I/O
DECNET	Sum of packet rates
CPU Load Balancing
MODES	Time spent by processors in each mode

See Table B-1 for a summary of MONITOR data items.

Contents