HP OpenVMS Systems Documentation |
OpenVMS Performance Management
8.2.1 Disk I/O Operation RateDisk statistics are provided in the MONITOR DISK class for mounted disks only. I/O Operation Rate is the rate of I/O operations completed on each mounted disk. It includes system I/O (paging, swapping, XQP) and user I/O. While operation rates are influenced by the hardware components of each disk and channel and depend upon transfer size, a general rule of thumb for operations of the size typically seen on timesharing systems can be stated for older VAX sytems: for most disks, an I/O rate less than 8 per second represents a light load, 15 per second is moderate, and a disk with an operation rate of 25 or more is heavily loaded. For newer systems with modern SCSI controllers and disks, a light load for most disks is 20 per second; 40 per second is moderate, and 80 per second is heavy. These figures are independent of host CPU configuration.
Though these numbers may seem slow compared to the raw capacity of the
device as specified in product literature, in the real world, even with
smart controllers, the effective capacity of disks in I/Os per second
is markedly reduced by nonsequential, odd-sized as well as larger I/Os.
The I/O Request Queue Length item is the average number of I/O requests outstanding at any time during the measurement period, including those being serviced and those waiting for service. For example, a queue length of 1.0 indicates that, on the average, every I/O request had to wait for a previous I/O request to complete.
As useful as these two measurements are in assessing disk performance, an even better measure is that of average response time in milliseconds. It can be estimated from these two items, for each disk, by using the following formula:
Average disk response time is an important statistic because it gives you a means of ranking the relative performance of your disks with respect to each other and of comparing their observed performance with typical values. To establish benchmark values, evaluate your disks as a whole when there is little or no contention. Consider the latency of your I/O controllers as well. Situations that might decrease response time include:
Because a certain amount of disk contention is expected in a timesharing environment, response times can be expected to be longer than the achievable values. The response time measurement is especially useful because it indicates the perceived delay from the norm, independent of whether the delay was caused by seek-intensive or data-transfer-intensive operations. Disks with response time calculations significantly larger than achievable values are good candidates for improvements, as discussed later. However, it is worth checking their levels of activity before proceeding with any further analysis. The response time figure says nothing about how often the disk has been used during the measurement period. Improving disks that show a high response time but are used very infrequently may not noticeably improve overall system performance. In most environments, a disk with a sustained queue length greater than 0.20 can be considered moderately busy and worthy of further analysis. You should try to determine whether activity on disks that show excessive response times and that are, at least, moderately busy, is primarily seek intensive or data-transfer intensive. Such disks exhibiting moderate-to-high operation rates are most likely seek intensive; whereas those with low operation rates and large queue lengths (greater than 0.50) tend to be data-transfer intensive. (An exception is a seek-intensive disk that is blocked by data transfer from another disk on the same channel; it can have a low operation rate and a large queue length, but not itself be data-transfer intensive). If a problem still exists after attempting to improve disk performance using the means discussed in Section 12.1, consider upgrading your hardware resources. An upgrade to address seek-intensive disk problems usually centers on the addition of one or more spindles (disk drives); whereas data transfer problems are usually addressed with the addition of one or more data channels.
8.2.3 Disk I/O Statistics for MSCP Served DisksIn OpenVMS Cluster configurations, the MSCP server software is used to make locally attached and HSC disks available to other nodes. A node has remote access to a disk when it accesses the disk through another node using the MSCP server. A node has direct access when it directly accesses a locally attached or HSC disk.
In the MONITOR MSCP display, an "R" following the device name
indicates that the displayed statistics represent I/O operations
requested by nodes using remote access. If an "R" does not
appear after the device name, the displayed statistics represent I/O
operations issued by nodes using direct access. Such I/O operations can
include those issued by the MSCP server on behalf of remote requests.
Direct I/O problems for disks or tapes reveal themselves in long delay times for I/O completions. The easiest way to confirm a direct I/O problem is to detect a particular device with a queue of pending requests. A queue indicates contention for a device or controller. For disks, the MONITOR command MONITOR DISK/ITEM=QUEUE_LENGTH provides this information. Because direct I/O refers to direct memory access (DMA) transfers that require relatively little CPU intervention, the performance degradation implies one or both of the following device-related conditions:
For ODS-1 performance information, see Appendix D.
For a disk or tape I/O limitation that degrades performance, the only
relatively low-cost solution available through tuning the software uses
memory to increase the sizes of the caches and buffers used in
processing the I/O operations, thereby decreasing the number of device
accesses. The other possible solutions involve purchasing additional
hardware, which is much more costly.
When you enter the MONITOR IO command and observe evidence of direct I/O, you will probably be able to determine whether the rate is normal for your site. A direct I/O rate for the entire system that is either higher or lower than what you consider normal warrants investigation. See Figures A-12 and A-13. You should proceed in this section only if you deem the operation rates of disk or tape devices to be significant among the possible sources of direct I/O on your system. If necessary, rule out any other possible devices as the primary source of the direct I/O with the lexical function F$GETDVI.
Compare the I/O rates derived in this manner or observed on the display
produced by the MONITOR DISK command with the rated capacity of the
device. (If you do not know the rated capacity, you should find it in
literature published for the device, such as a peripherals handbook or
a marketing specifications sheet.)
An abnormally high direct I/O rate for any device, in conjunction with degraded system performance, suggests that I/O demand for that device exceeds its capacity. First, you need to find out where the I/O operations are occurring. Enter the MONITOR PROCESSES/TOPDIO command. From this display, you can determine which processes are heavy users of I/O and, in particular, which processes are succeeding in completing their I/O operations---not which processes are waiting. Next, you must determine which of the devices used by the processes that are the heaviest users of the direct I/O resource also have the highest operations counts so that you can finally identify the bottleneck area. Here, you must know your work load sufficiently well to know the devices the various processes use. If you note that these devices are among the ones you found queued up, you have now found the bottleneck points. Once you have identified the device that is saturated, you need to determine the types of I/O activities it experiences. Perhaps some of them are being mishandled and could be corrected or adjusted. Possibilities are file system caching, RMS buffering, use of explicit QIOs in user programs, and paging or swapping. After you eliminate these possibilities, you may conclude that the device is simply unable to handle the load.
File System Caching Is Suboptimal
To evaluate the effectiveness of caching, observe the display produced by the MONITOR FILE_SYSTEM_CACHE command. If cache hits are 70 percent or greater, caching activity is normal. A lower percentage, combined with a large number of attempts, indicates that caching is less than optimally effective. You should be certain that your applications are designed to minimize the opening and closing of files. You should also verify that the file allocation and extent sizes are appropriate. Use the DCL command DIRECTORY/SIZE=ALL to display the space used by the files and the space allocated to them. If the proportion of space used to space allocated seems close to 90 percent, no changes are necessary. However, significantly lower utilization should prompt you to set more accurate values, either explicitly or by changing the defaults, particularly on critical files. You use the RMS_EXTEND_SIZE system parameter to define the default file extents on a systemwide basis. The DCL command SET RMS_DEFAULT/EXTEND_QUANTITY permits you to define file extents on a per-process basis (or on a systemwide basis if you also specify the /SYSTEM qualifier). For more information, see the Guide to OpenVMS File Applications. If these are standard practices at your site, see Section 12.5 for a discussion of how to adjust the following ACP system parameters: ACP_HDRCACHE, ACP_MAPCACHE, and ACP_DIRCACHE. Misuse of RMS can cause direct I/O limitations. If users are blocked on the disks because of multiblock counts that are unnecessarily large, instruct the users to reduce the size of their disk transfers by lowering the multiblock count with the DCL command SET RMS_DEFAULT/BLOCK_COUNT. See Section 12.4 for a discussion of how to improve RMS caching.
If this course is partially effective but the problem is widespread,
you could decide to take action on a systemwide basis. You can alter
one or more of the system parameters in the RMS_DFMB group with
AUTOGEN, or you can include the appropriate SET RMS_DEFAULT command in
the systemwide login command procedure. See the Guide to OpenVMS File Applications.
If you do not detect processes running programs with explicit user-written QIOs, you should suspect that the operating system is generating disk activity due to paging or swapping activity, or both. The paging or swapping may be quite appropriate and not introduce any memory management problem. However, some aspect of the configuration is allowing this paging or swapping activity to block other I/O activity, introducing an I/O limitation. Enter the MONITOR IO command to inspect the Page Read I/O Rate and Page Write I/O Rate (for paging activity) and the Inswap Rate (for swapping activity). Note that because system I/O activity to the disk is not reflected in the direct I/O count MONITOR provides, MONITOR IO is the correct tool to use here.
If you find indications of substantial paging or swapping (or both) at
this point in the investigation, consider whether the paging and
swapping files are located on the best choice of device, controller, or
bus in the configuration. Also consider whether introducing secondary
files and separating the files would be beneficial. A later section
discusses relocating the files to bring about performance improvements.
The only low-cost solutions that remain require reductions in demand. You can try to shift the work load so that less demand is placed simultaneously on the direct I/O devices. Instead, you might reconfigure the magnetic tapes and disks on separate buses to reduce demand on the bus. (If there are no other available buses configured on the system, you may want to acquire buses so that you can take this action.)
If none of the above solutions improved performance, you may need to
add capacity. You probably need to acquire disks with higher transfer
rates rather than simply adding more disks. However, if you have been
employing magnetic tapes extensively, you may want to investigate ways
of shifting your applications to use disks more effectively.
Chapter 12 provides a number of suggestions for reducing demand or
adding capacity.
In many cases, the use of directly connected terminals has been replaced with network connected terminals (using LAT, DECnet or TCP/IP), or with a windowing system such as DECwindows. In these cases, the terminal connections will be part of the network load. However, some applications still rely on terminals or similar devices connected over serial lines either directly or through modems. This section applies to this type of terminal.
Terminal operation, when improperly handled, can present a serious
drain on system resources. However, the resource that is consumed is
the CPU, not I/O. Terminal operation is actually a case for CPU
limitation investigation but is included here because it may initially
appear to be an I/O problem.
You will first suspect a terminal I/O problem when you detect a high
buffered I/O rate on the display for the MONITOR IO command. See
Figure A-14. Next, you should enter the MONITOR STATES command to
check if processes are in the COM state. This condition, in combination
with a high buffered I/O rate, suggests that the CPU is constricted by
terminal I/O demands. If you do not observe processes in the computable
state, you should conclude that while there is substantial buffered I/O
occurring, the system is handling it well. In that case, the problem
lies elsewhere.
If you do observe processes in the COM state, you must verify that the
high buffered I/O count is actually due to terminals and not to
communications devices, line printers, graphics devices, devices or
instrumentation provided by other vendors, or devices that emulate
terminals. Examine the operations counts for all such devices with the
lexical function F$GETDVI. See Section 8.3.2 for a discussion about
determining direct I/O rates. A high operations count for any device
other than a terminal device indicates that you should explore the
possibility that the other device is consuming the CPU resource.
If you find that the operations count for terminals is a high
percentage of the total buffered I/O count, you can conclude that
terminal I/O is degrading system performance. To further investigate
this problem, enter the MONITOR MODES command. From this display, you
should expect to find much time spent either in interrupt state or in
kernel mode. Too much time in interrupt state suggests that too many
characters are being transmitted in a few very large QIOs. Too much
time in kernel mode could indicate that too many small QIOs are
occurring.
If the MONITOR MODES display shows much time spent in kernel mode, perhaps the sheer number of QIOs involved is burdening the CPU. See Figure A-15. Explore whether the application can be redesigned to group the large number of QIOs into smaller numbers of QIOs that transfer more characters at a time. Such a design change could alleviate the condition, particularly if burst output devices are in use. It is also possible that some adjustment in the work load is feasible, which would balance the demand.
If neither of these approaches is possible, you need to reduce demand
or increase the capacity of the CPU (see Section 13.7).
Use the following MONITOR statistics to obtain the appropriate information:
See Table B-1 for a summary of MONITOR data items.
|