HP OpenVMS Systems Documentation

OpenVMS Performance Management

Chapter 7
Evaluating the Memory Resource

The key to successful performance management of an OpenVMS system is to keep the memory management activity to a minimum. You will find that memory limitations cause paging, swapping, or both, precisely the activities you want to minimize. It requires skillful balancing of the memory management mechanism to reduce one without incurring too much of the other.

7.1 Understanding the Memory Resource

The memory resource shares some similarities with the other resources, but it exhibits some notable differences. It is similar to the CPU and disk in that it is a single resource pool that must be shared, but different in the sense that it can be separated into pieces of varying size, all of which can be allocated to processes simultaneously. A process can retain its allocation of memory until memory is demanded by other processes (page faulting), at which time the sizes of the pieces are reconfigured. In some cases, certain processes must wait longer for their allocations (swapping).

7.1.1 Working Set Size

The key to good performance of the memory subsystem is to maintain working sets of appropriate size for resident processes. As a rule, the total of all resident process working set quotas should be within the amount of free memory available on the system. When there is abundant free memory available, the borrowing mechanism of the memory management subsystem allows working sets to grow to the value specified in the user authorization file by WSEXTENT. However, you should set the WSQUOTA value so that user programs can have reasonable faulting behavior even if they can grow only to WSQUOTA.

7.1.2 Locality of Reference

Erratic code and data reference patterns by user programs can cause memory to be used inefficiently. Locality of reference is a characteristic of a program that indicates how close or far apart the references to locations in virtual memory are over time. A program with a high degree of locality does not refer to many widely scattered virtual addresses in a short period of time. If an application has been designed with poor virtual address reference patterns, it can require an extremely large WSQUOTA value to perform satisfactorily.

In addition, applications such as AI and CAD/CAM, which perform an inordinately large amount of dynamic memory allocation, often require very large WSQUOTA values. Database programs may also benefit from larger working sets if they cache significant amounts of data or indexes in memory.

7.1.3 Obtaining Working Set Values

One way to obtain information about working set values on the running system (Example 7-2) is to use the procedure shown in Example 7-1. You may want to execute it several times during some representative period of loading to gain an idea of the steady-state working set requirements for your system.

Example 7-1 Procedure to Obtain Working Set Information

$!
$! WORKING_SET.COM - Command file to display working set information.
$!                   Requires 'WORLD' privilege to display information
$!                   on processes other than your own.
$!
$! the next symbol is used to insert quotes into command strings
$! because of the way DCL processes quotes, you can't have a
$! trailing comment after the quotes on the next line.
$!
$ quote = """
$!
$ pid = ""  ! initialize to blank
$ context = ""  ! initialize to blank
$!
$! Define a format control string which will be used with
$! F$FAO to output the information.  The width of the
$! string will be set according to the width of the
$! display terminal (the image name is truncated, if needed).
$!
$ IF F$GETDVI ("SYS$OUTPUT", "DEVBUFSIZ") .LE. 80
$ THEN
$    ctrlstring = "!AS!15AS!5AS!5(6SL)!7SL !10AS"
$ ELSE
$    ctrlstring = "!AS!15AS!5AS!5(6SL)!7SL !AS"
$ ENDIF
$!
$! Check to see if this procedure was invoked with the PID of
$! one specific process to check.  If it was, use that PID.  If
$! not, the procedure will scan for all PIDs where there is
$! sufficient privilege to fetch the information.
$!
$ IF p1 .NES. "" THEN pid = p1
$!
$! write out a header.
$!
$ WRITE sys$output -
"                          Working Set Information"
$ WRITE sys$output ""
$ WRITE sys$output -
"                                 WS    WS    WS     WS   Pages  Page"
$ WRITE sys$output -
"Username    Processname   State  Extnt Quota Deflt  Size in WS  Faults Image"
$ WRITE sys$output ""
$!
$! Begin collecting information.
$!
$ collect_loop:
$!
$ IF P1  .EQS. "" THEN pid = F$PID (context) ! get this process' PID
$ IF pid .EQS. "" THEN EXIT   ! if blank, no more to
$!      ! check, or no privilege
$ pid = quote + pid + quote   ! enclose in quotes
$!
$ username     = F$GETJPI ('pid, "USERNAME") ! retrieve proc. info.
$!
$ IF username .EQS. "" THEN GOTO collect_loop ! if blank, no priv.; try
$!      ! next PID
$ processname  = F$GETJPI ('pid, "PRCNAM")
$ imagename    = F$GETJPI ('pid, "IMAGNAME")
$ imagename    = F$PARSE  (imagename,,,"NAME") ! separate name from filespec
$ state        = F$GETJPI ('pid, "STATE")
$ wsdefault    = F$GETJPI ('pid, "DFWSCNT")
$ wsquota      = F$GETJPI ('pid, "WSQUOTA")
$ wsextent     = F$GETJPI ('pid, "WSEXTENT")
$ wssize       = F$GETJPI ('pid, "WSSIZE")
$ globalpages  = F$GETJPI ('pid, "GPGCNT")
$ processpages = F$GETJPI ('pid, "PPGCNT")
$ pagefaults   = F$GETJPI ('pid, "PAGEFLTS")
$!
$ pages        = globalpages + processpages ! add pages together
$!
$! format the information into a text string
$!
$ text = F$FAO (ctrlstring, -
  username, processname, state, wsextent, wsquota, wsdefault, wssize, -
  pages, pagefaults, imagename)
$!
$ WRITE sys$output text    ! display information
$!
$ IF p1 .NES. "" THEN EXIT   ! if not invoked for a
$!      ! specific PID, we're done.
$ GOTO collect_loop    ! repeat for next PID

7.1.4 Displaying Working Set Values

The WORKING_SET.COM procedure produces the following display:

Example 7-2 Displaying Working Set Values

                          Working Set Information

                                   WS    WS    WS    WS  Pages  Page
Username    Processname   State  Extnt Quota Deflt  Size in WS faults  Image

SYSTEM      ERRFMT         HIB    1024   512   100    60    60    165 ERRFMT
SYSTEM      CACHE_SERVER   HIB    1024   512   100   512    75     55 FILESERV
SYSTEM      CLUSTER_SERVER HIB    1024   512   100    60    60    218 CSP

SYSTEM      OPCOM          LEF    2048   512   100   210    59   5764 OPCOM
SYSTEM      JOB_CONTROL    HIB    1024   512   100   360   238   1459 JOBCTL
SYSTEM      CONFIGURE      HIB    1024   512   100   125   121    101 CONFIGURE
SYSTEM      SYMBIONT_0001  HIB    1024   512   100   668    57  67853 PRTSMB
DECNET      NETACP         HIB    1500   750   175  1200   812  10305 NETACP
DECNET      EVL            HIB    1024   350   175   210    33  84080 EVL
SYSTEM      REMACP         HIB    1024   350   175    60    47     74 REMACP
SYSTEM      VAXsim_Monitor HIB    1024   200   100   350   210   1583 VAXSIM
SYSTEM      DBMS_MONITOR   LEF    1000   512   150    62    62    488 DBMMON
SYSTEM      TINKERBELLE    LEF    1024   350   175   325   177   1627
SYSTEM      NULF           COM    1024   350   250   350   246   1007 FAC
HALL        CFAI           COM    2400  1024   512   662   358    567 CFAI
VTXUP       VTX_SERVER     LEF    2400  1024   512   962   696    624 VTXSRV
WEINSTEIN   Jane           LEF    2400  1024   512   662   432  13132 EDT
HURWITZ     HURWITZ        LEF    2400  1024   512   512   350   4605
CARMODY     CARMODY        LEF    2400  1024   512   812   546  16822 MAIL
CAPARILLIO  CAPARILLIO     CUR    2400  1024   512   512   282  10839
STRATFORD   Kathy          LEF    2400  1024   512   512   210   9852
FREY        _VTA270:       LEF    2400  1024   512   512   163   1021
CHRISTOPHER _VTA271:       LEF    2400  1024   512   512   252    379
STANLEY     STANLEY        LEF    2048  1024   512   512   295  10369
MINSKY      MINSKY         LEF    2400  1024   512   512   143  60316
TESTGEN     TESTGEN        LEF    4100  1024   512   234    84  75753
CLAYMORE    Cluster Buster LEF    2400  1024   512  1262   932   1919 CREATOR
DINEAUX     Sally          LEF    2400  1024   512   512   330  31803
DECNET      SERVER_0848    LEF    1024   350   175   325   183    647 NETSERVER
LUZ         Lars           LEF    2400  1024   512  1024   980  95420 TEX
DECNET      MAIL_222       LEF    1024   350   175   325   234    526 MAIL

STEVENS     STEVENS        LEF    2400  1024   512   512   221   7851
ZEN         _VTA259:       LEF    2400  1024   512  1024   319   4267 SHOW
ZEN         ZEN_2          LEF    2400  1024   512   512   171   3026)

Field	Description
WS Deflt	Default working set size, which is reestablished at each image activation.
WS Size	Current size of the working set. When the number of pages actually allocated (Pages in WS) reaches this threshold, subsequent page faults will cause page replacement.
Pages in WS	Both private and global pages.
WS Extnt WS Quota	Threshold values to which WS Size can be adjusted.
Page faults	Total number of faults that have occurred since process creation.

7.2 Evaluating Memory Responsiveness

The key measure of responsiveness for the memory management subsystem is the amount of time required for a process to be allocated its share of memory.

Because allocation time is not measured directly, you should be concerned with the rates of the two memory management activities that extend the processing time experienced by processes in a virtual memory system---namely, page faulting and swapping. These activities not only incur overhead on the CPU and disk resources, but they also block the execution of processes during the time the system needs to allocate memory and the time the processes spend waiting for memory allocation.

Thus, your goal in evaluating the memory resource is to ensure that faulting and swapping rates are kept within reasonable bounds.

7.2.1 Page Faulting

Whenever a process references a virtual page that is not in its working set, a page fault occurs. For process execution to continue, memory management software is called to acquire and map a physical page into the working set.

7.2.1.1 Hard and Soft Page Faults

The fault can be hard or soft. A hard fault (measured by the Page Read I/O Rate item in the MONITOR PAGE class) is one that requires a read operation from a page or image file on disk. A soft fault is one that is satisfied by mapping to a page already in memory; this can be a global page or a page in the secondary page cache. (The secondary page cache consists of the free-page and the modified-page list; the primary page cache is each process's working set.) The following categories of soft faults are measured and reported in the MONITOR PAGE class:

Free List Fault Rate---The rate of page faults satisfied by reclaiming from the free-page list a page that was previously allocated to a process. An excessive rate of free-page list faults can occur when working set quotas are too small, causing excessive page replacement.
Modified List Fault Rate---The rate of page faults satisfied by reclaiming a page from the modified-page list. An excessive rate of modified-page list faults can occur when working set quotas are too small.
Demand Zero Fault Rate---The rate of page faults satisfied by allocating a free page and initializing its contents to zero. This type of fault is typically seen during image activation and whenever the virtual address space is expanded.
Global Valid Fault Rate---The rate of page faults satisfied by mapping a shared page that is already valid (one already in another process's working set). Swapping or image activation can cause an elevated global valid fault rate.
Write in Progress Fault Rate---The rate of page faults satisfied by mapping to a page that is in the process of being written back to disk. The rate for this type of fault is typically very low.

The total Page Fault Rate is equal to the sum of the hard fault rate (Page Read I/O Rate) plus the soft fault rate, which is the sum of the five categories listed above.

System Fault Rate is the rate of faults for which the referenced virtual address is in system space (hex address 80000000 and above). It is not included in the overall Page Fault Rate, and is discussed separately in Section 11.1.2.

Your own judgment, based on familiarity with the data in your MONITOR summaries, is the best determinant of an acceptable Page Fault Rate for your system.

When either of the following thresholds is exceeded, you may want to consider improving memory responsiveness. (See Section 11.1.)

Hard faults (Page Read I/O Rate) should be kept as low as possible, but to no more than 10% of the overall Page Fault Rate. When the hard fault rate exceeds this threshold, you can assume that the secondary page cache is not being used efficiently.
Overall Page Fault Rate begins to become excessive when more than 1--2% of the CPU is devoted to soft faulting (faulting that involves no disk I/O).
While these rules do not represent absolute upper limits, rates that exceed the suggested limits are warning signs that the memory resource should either be improved by one of the four means listed in Section 11.1, or that a memory upgrade should be considered. Note, however, that more memory will not reduce the number of page faults caused by image activation.

7.2.1.2 Secondary Page Cache

Paging problems typically occur when the secondary page cache (free-page list and modified-page list) is too small. This systemwide cache, which is sized by AUTOGEN, should be large enough to ensure that the overall fault rate is not excessive and that most faults are soft faults.

When evaluating paging activity on your system, you should check for processes in the free page wait (FPG), collided page wait (COLPG), and page fault wait (PFW) states and note departures from normal figures. The presence of processes in the FPG state almost always indicates serious memory management problems, because it implies that the free-page list has been depleted.

Processes in the PFW and COLPG states are waiting for hard faults (from disk) to be satisfied. Note, however, that while hard fault waiting is undesirable, it is not as serious as swapping.

An average free-page list size that is between the values of the FREELIM and FREEGOAL system parameters usually indicates deficient memory and is often accompanied by a high page fault rate. If either condition exists, or if the hard fault rate exceeds the recommended percentage, you must consider enlarging the free- and modified-page lists, if possible. Enlarging the secondary page cache could reduce hard faulting, provided such faulting is not the result of image activation.

The easiest way to increase the free page cache is to increase the value of FREEGOAL. Active reclamation will then attempt to recover more memory from idle processes. Typically, overall fault rates decrease when active reclamation is enabled because memory is more readily available to active processes.

A high rate of modified-page writing, for example, as shown in the Page Write I/O Rate field of the MONITOR PAGE display, is an indication that the modified-page list might be too small. A write rate of 1 every 2 seconds is fairly high. The modified-page list should be large enough to provide an equilibrium between the rate at which pages are added to the list versus the modified-page list fault rate without causing excessive list writing by reaching MPW_HILIMIT. If you do adjust the size of the modified-page list using MPW_HILIMIT, make sure you retain the relationship among MPW_HILIMIT, MPW_WAITLIMIT, and MPW_LOWAITLIMIT by using AUTOGEN.

If you are able to increase the size of the free-page list, you can then allocate more memory to the modified-page list. Using AUTOGEN, you can increase the modified-page list by adjusting the appropriate MPW system parameters. (See the OpenVMS System Management Utilities Reference Manual for a description of MPW parameters.)

7.2.2 Swapping and Swapper Trimming

Swapping, when considered in isolation, is an expensive operation. It can place a huge transfer load on the I/O subsystem instantaneously. Swapping also can place heavy demand on CPU resources. However, when used as part of the active memory reclamation policy, swapping results in improved---that is, reduced---memory consumption and a lower page fault rate.

Good and Bad Swapping

There is good swapping and bad swapping. The latter occurs as the last step of reactive memory reclamation when the free-page list is exhausted---that is, when it is smaller than FREELIM. However, having a significant number of outswapped processes on your system when active memory reclamation is enabled is not a cause for alarm. A much more reliable indicator that harmful swapping is occurring is a high inswap rate---for example, greater than one process per second.

Artificially Induced Swapping

Before attempting to improve a system with a high inswap rate, do the following:

Check for a condition known as artificially induced swapping. This condition occurs when there are no available balance set slots.
Check the BALSETCNT system parameter. Swapping may have been artificially induced because BALSETCNT is set too low (see Section 11.14).

You can obtain information on balance slots with the DCL command SHOW MEMORY.

A possible, although unlikely, reason for a high inswap rate might be an overly large value for FREEGOAL when active memory reclamation is enabled. Although this policy outswaps only long-waiting processes, a very large value for FREEGOAL will cause the outswapping of many long-waiting processes over time, thus increasing the inswap rate as these processes become computable.

7.3 Analyzing the Excessive Paging Symptom

Whenever you detect paging or swapping on a system with degraded performance, you should investigate a memory limitation. If you observe a lack of free memory but no serious paging or swapping, the system may be just at the point where it will begin to experience excessive paging or swapping if demand grows any more.

In this case, you have a bit of advance warning, and you may want to examine some preventive measures.

7.3.1 What Is Excessive Paging?

There are no universally applicable scales that rank page faulting rates from moderate to excessive.

Although the only good page faulting rate is zero page faults per second, you need to think in terms of the maximum tolerable rate of page faulting for your system.

7.3.2 Guidelines

Observe the following guidelines:

You should define the maximum tolerable page fault rate. You should view any higher page fault rate as excessive.
Paging always consumes system resources (CPU and I/O), therefore, its harmfulness depends entirely on the availability of the resources consumed.
In judging what page faulting rate is the maximum tolerable rate for your system, you must consider your configuration and the type of paging that is occurring.
For example, on a system with slow disks, what might otherwise seem to be a low rate of paging to the disk could actually represent intolerable paging because of the response time through the slow disk. This is especially true if the percentage of page faults from the disk is high relative to the total number of faults.
You can judge page fault rates only in the context of your own configuration.
The statistics must be examined in the context of both the overall faulting and the apparent system performance. The system manager who knows the configuration can best evaluate the impact of page faulting.

Once you have determined that the rate of paging is excessive, you need to determine the cause. As Figure A-3 shows, you can begin by looking at the number of image activations that have been occurring.

7.3.3 Excessive Image Activations

Use ACCOUNTING to examine the total number of images started.

If...	Then...
Image-level accounting is enabled and the value is in the low-to-normal range for typical operations at your site	The problem lies elsewhere.
Image-level accounting is NOT enabled	Check the display produced by the MONITOR PAGE command for demand zero faults.
50% of all page faults are demand zero faults	Image activations are too frequent.

Additional Considerations

If image activations seem to be excessive, do the following:

Enable image-level accounting (if it is not enabled) at this time and collect enough data to confirm the conclusion about the high percentage of demand zero faults.
Determine how to reduce the number of image activations by reviewing the guidelines for application design in Section 11.2.
The problem of paging induced by image activations is unlikely to respond to any attempt at system tuning. The appropriate action involves application design changes.

7.3.4 Characterizing Hard Versus Soft Faults

You should characterize your page faulting. Paging from disk is hard paging, and it is the less desirable of the two.

Soft paging refers to paging from the page cache in main memory. Although soft paging is undesirable when it is excessive, it is normally much less costly to overall system performance than disk paging, simply because it is faster.

Contents