HP OpenVMS Systems Documentation

HP Fortran for OpenVMS
User Manual

Contents

Index

Chapter 5
Performance: Making Programs Run Faster

This chapter describes:

5.1 Software Environment and Efficient Compilation

Before you attempt to analyze and improve program performance, you should:

Obtain and install the latest version of HP Fortran, along with performance products that can improve application performance, such as the Compaq Extended Mathematical Library (CXML).
If possible, obtain and install the latest version of the OpenVMS operating system and processor firmware for your system.
Use the FORTRAN command and its qualifiers in a manner that lets the HP Fortran compiler perform as many optimizations as possible to improve run-time performance.
Use certain performance capabilities provided by the OpenVMS operating system.

5.1.1 Install the Latest Version of HP Fortran and Performance Products

To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:

The latest version of HP Fortran
New releases of the HP Fortran compiler and its associated run-time libraries may provide new features that improve run-time performance. The HP Fortran run-time libraries are shipped with the OpenVMS operating system.
If your application will be run on an OpenVMS system other than your program development system, be sure to use the same (or later) version of the OpenVMS operating system on those systems.
You can obtain the appropriate HP Services software product maintenance contract to automatically receive new versions of HP Fortran (or the OpenVMS operating system). For information on more recent HP Fortran releases, contact the HP Customer Support Center (CSC) if you have the appropriate support contract, or contact your local HP sales representative or authorized reseller.
Compaq Extended Mathematical Library (CXML) for OpenVMS Alpha Systems
Calling the Compaq Extended Mathematical Library (CXML) routines and installing the CXML product can make certain applications run significantly faster on OpenVMS Alpha systems. Refer to Chapter 15 for information on CXML.
Performance and Coverage Analyzer (profiler part of DECset)
You can purchase the Performance and Coverage Analyzer (PCA) product, which performs code profiling. PCA is one of a group of products comprising a development environment available from HP known as DECset. Other DECset products include the Language-Sensitive Editor (LSE), Source Code Analyzer (SCA), Code Management System (CMS), and the DEC/Test Manager (DTM).
Use of the Source Code Analyzer (SCA) is supported by using the /ANALYSIS_DATA qualifier (see Section 2.3.4) to produce an analysis data file.
Other system-wide performance products
Other products are not specific to a particular programming language or application, but can improve system-wide performance, such as minimizing disk device I/O.
Adequate process quotas and pagefile space as well as proper system tuning are especially important when running large programs, such as those accessing large arrays.

For More Information:

About system-wide tuning and suggestions for other performance enhancements on OpenVMS systems, see the HP OpenVMS System Manager's Manual, Volume 2: Tuning, Monitoring, and Complex Systems.

5.1.2 Compile Using Multiple Source Files and Appropriate FORTRAN Qualifiers

During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:

$ FORTRAN /OPTIMIZE=LEVEL=1 SUB2
$ FORTRAN /OPTIMIZE=LEVEL=1 SUB3
$ FORTRAN /OPTIMIZE=LEVEL=1 MAIN
$ LINK MAIN SUB2 SUB3

During the later stages of program development, you should compile multiple source files together and use an optimization level of at least /OPTIMIZE=LEVEL=4 on the FORTRAN command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization (/OPTIMIZE=LEVEL=4):

$ FORTRAN MAIN.F90+SUB2.F90+SUB3.F90
$ LINK MAIN.OBJ

Compiling multiple source files using the plus sign (+) separator lets the compiler examine more code for possible optimizations, which results in:

Inlining more procedures
More complete data flow analysis
Reducing the number of external references to be resolved during linking

When compiling all source files together is not feasible (such as for very large programs), consider compiling source files containing related routines together with multiple FORTRAN commands, rather than compiling source files individually.

Table 5-1 shows FORTRAN qualifiers that can improve performance. Most of these qualifiers do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.

HP Fortran performs certain optimizations unless you specify the appropriate FORTRAN command qualifiers. Additional optimizations can be enabled or disabled using FORTRAN command qualifiers.

Table 5-1 lists the FORTRAN qualifiers that can directly improve run-time performance.

**Table 5-1 FORTRAN Qualifiers Related to Run-Time Performance**
Qualifier Names	Description and For More Information
/ALIGNMENT= keyword	Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran 77 record structures to make the data items naturally aligned. See Section 5.3.
/ASSUME=NOACCURACY_SENSITIVE	Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default (/ASSUME=ACCURACY_SENSITIVE) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. See Section 5.8.8.
/ARCHITECTURE= keyword (Alpha only)	Specifies the type of Alpha architecture code instructions to be generated for the program unit being compiled; it uses the same options (keywords) as used by the /OPTIMIZE=TUNE qualifier (Alpha only) (which controls instruction scheduling). See Section 2.3.6.
/FAST	Sets the following performance-related qualifiers: /ALIGNMENT=(COMMONS=NATURAL, RECORDS=NATURAL, SEQUENCE) /ARCHITECTURE=HOST, /ASSUME=NOACCURACY_SENSITIVE, /MATH_LIBRARY=FAST (Alpha only), and /OPTIMIZE=TUNE=HOST (Alpha only). See Section 5.8.3.
/INTEGER_SIZE= nn	Controls the sizes of INTEGER and LOGICAL declarations without a kind parameter. See Section 2.3.26.
/MATH_LIBRARY=FAST (Alpha only)	Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. See Section 2.3.30.
/OPTIMIZE=INLINE= keyword	Specifies the types of procedures to be inlined. If omitted, /OPTIMIZE=LEVEL= n determines the types of procedures inlined. Certain INLINE keywords are relevant only for /OPTIMIZE=LEVEL=1 or higher. See Section 2.3.35.
/OPTIMIZE=LEVEL= n (n = 0 to 5)	Controls the optimization level and thus the types of optimization performed. The default optimization level is /OPTIMIZE=LEVEL=4. Use /OPTIMIZE=LEVEL=5 to activate loop transformation optimizations. See Section 5.7.
/OPTIMIZE=LOOPS	Activates a group of loop transformation optimizations (a subset of /OPTIMIZE=LEVEL=5). See Section 5.7.
/OPTIMIZE=PIPELINE	Activates the software pipelining optimization (a subset of /OPTIMIZE=LEVEL=4). See Section 5.7.
/OPTIMIZE=TUNE= keyword (Alpha only)	Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of /OPTIMIZE=TUNE= xxxx, the generated code will run correctly on all implementations of the Alpha architecture. See Section 5.8.6.
/OPTIMIZE=UNROLL= n	Specifies the number of times a loop is unrolled ( n) when specified with optimization level /OPTIMIZE=LEVEL=3 or higher. If you omit /OPTIMIZE=UNROLL= n, the optimizer determines how many times loops are unrolled. See Section 5.7.4.1.
/REENTRANCY	Specifies whether code generated for the main program and any Fortran procedures it calls will be relying on threaded or asynchronous reentrancy. See Section 2.3.39.

Table 5-2 lists qualifiers that can slow program performance. Some applications that require floating-point exception handling or rounding need to use the /IEEE_MODE and /ROUNDING_MODE qualifiers. Other applications might need to use the /ASSUME=DUMMY_ALIASES qualifier for compatibility reasons. Other qualifiers listed in Table 5-2 are primarily for troubleshooting or debugging purposes.

**Table 5-2 Qualifiers that Slow Run-Time Performance**
Qualifier Names	Description and For More Information
/ASSUME=DUMMY_ALIASES	Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify /ASSUME=DUMMY_ALIASES only for the called subprograms that depend on such aliases. The use of dummy aliases violates the FORTRAN-77, Fortran 90, and Fortran 95 standards but occurs in some older programs. See Section 5.8.9.
/CHECK[= keyword]	Generates extra code for various types of checking at run time. This increases the size of the executable image, but may be needed for certain programs to handle arithmetic exceptions. Avoid using /CHECK=ALL except for debugging purposes. See Section 2.3.11.
/IEEE_MODE= keyword other than /IEEE_MODE=DENORM_RESULTS (on I64) or /IEEE_MODE=FAST (on Alpha)	On Alpha systems, using /IEEE_MODE=UNDERFLOW_TO_ZERO slows program execution (like /SYNCHRONOUS_EXCEPTIONS (Alpha only)). Using /IEEE_MODE=DENORM_RESULTS slows program execution even more than /IEEE_MODE=UNDERFLOW_TO_ZERO. See Section 2.3.24.
/ROUNDING_MODE=DYNAMIC	Certain rounding modes and changing the rounding mode can slow program execution slightly. See Section 2.3.40.
/SYNCHRONOUS_EXCEPTIONS	Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing program execution. Use this qualifier only when troubleshooting, such as when identifying the source of an exception. See Section 2.3.46.
/OPTIMIZE=LEVEL=0, /OPTIMIZE=LEVEL=1, /OPTIMIZE=LEVEL=2, /OPTIMIZE=LEVEL=3	Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. See Section 2.3.35 and Section 5.7.
/OPTIMIZE=INLINE=NONE, /OPTIMIZE=INLINE=MANUAL	Minimizes the types of inlining done by the optimizer. Use such qualifiers only during the early stages of program development. The type of inlining optimizations are also controlled by the /OPTIMIZE=LEVEL qualifier. See Section 2.3.35 and Section 5.7.

For More Information:

On compiling multiple files, see Section 2.2.1.
On minimizing external references, see Section 10.2.1.

5.1.3 Process Environment and Related Influences on Performance

Certain DCL commands and system tuning can improve run-time performance:

Specify adequate process limits and do system tuning.
Especially when compiling or running large programs, check to make sure that process limits are adequate. In some cases, inadequate process limits may prolong compilation or program execution. For more information, see Section 1.2.
Your system manager can tune the system for efficient use. For example, to monitor system use during program execution or compilation, a system manager can use the MONITOR command.
For more information on system tuning, see your operating system documentation.
Redirect scrolled text.
For programs that display a lot of text, consider redirecting text that is usually displayed to SYS$OUTPUT to a file. Displaying a lot of text will slow down execution; scrolling text in a terminal window on a workstation can cause an I/O bottleneck (increased elapsed time) and use some CPU time.
The following commands show how to run the program more efficiently by redirecting output to a file and then displaying the program output:
$ DEFINE /USER FOR006 RESULTS.LIS $ RUN MYPROG $ TYPE/PAGE RESULTS.LIS

For More Information:

About system-wide tuning and suggestions for other performance enhancements on OpenVMS systems, see the HP OpenVMS System Manager's Manual, Volume 2: Tuning, Monitoring, and Complex Systems.

5.2 Analyzing Program Performance

This section describes how you can:

Analyze program performance using timings of program execution using LIB$xxxx_TIMER routines or an equivalent DCL command procedure ( Section 5.2.1)
Analyze program performance using the optional Performance Coverage Analyzer tool ( Section 5.2.2)

Before you analyze program performance, make sure any errors you might have encountered during the early stages of program development have been corrected.

5.2.1 Measuring Performance Using LIB$xxxx_TIMER Routines or Command Procedures

You can use LIB$xxxx_TIMER routines or an equivalent DCL command procedure to measure program performance.

Using the LIB$xxxx_TIMER routines allows you to display timing and related statistics at various points in the program as well as at program completion, including elapsed time, actual CPU time, buffered I/O, direct I/O, and page faults. If needed, you can use other routines or system services to obtain and report other information.

You can measure performance for the entire program by using a DCL command procedure (see Section 5.2.1.2). Although using a DCL command procedure does not report statistics at various points in the program, it can provide information for the entire program similar to that provided by the LIB$xxxx_TIMER routines.

5.2.1.1 The LIB$xxxx_TIMER Routines

Use the following routines together to provide information about program performance at various points in your program:

LIB$INIT_TIMER stores the current values of specified times and counts for use by LIB$SHOW_TIMER or LIB$STAT_TIMER routines.
LIB$SHOW_TIMER returns times and counts accumulated since the last call to LIB$INIT_TIMER and displays them on SYS$OUTPUT.
LIB$STAT_TIMER returns times and counts accumulated since the last call to LIB$INIT_TIMER and stores them in memory.

Run program timings when other users are not active. Your timing results can be affected by one or more CPU-intensive processes also running while doing your timings.

Try to run the program under the same conditions each time to provide the most accurate results, especially when comparing execution times of a previous version of the same program. Use the same CPU system (model, amount of memory, version of the operating system, and so on) if possible.

If you do need to change systems, you should measure the time using the same version of the program on both systems, so you know each system's effect on your timings.

For programs that run for less than a few seconds, repeat the timings several times to ensure that the results are not misleading. Overhead functions might influence short timings considerably.

You can use the LIB$SHOW_TIMER (or LIB$STAT_TIMER) routine to return elapsed time, CPU time, buffered I/O, direct I/O, and page faults:

The elapsed time, which will be greater than the total charged actual CPU time. Sometimes called "wall clock" time.
Charged actual CPU time is the amount of actual CPU time used by the process.
Buffered I/O occurs when an intermediate buffer is used from the system buffer pool, instead of a process-specific buffer.
Direct I/O is when I/O transfer takes place directly between the process buffer and the device.
A page fault is when a reference to a page occurs that is not in the process working set.

The HP Fortran program shown in Example 5-1 reports timings for the three different sections of the main program, including accumulative statistics (for a scalar program).

Example 5-1 Measuring Program Performance Using LIB$SHOW_TIMER and LIB$INIT_TIMER

!  Example use of LIB$SHOW_TIMER to time an HP Fortran program

 PROGRAM TIMER

   INTEGER TIMER_CONTEXT
   DATA    TIMER_CONTEXT /0/

!  Initialize default timer stats to 0

   CALL LIB$INIT_TIMER

!  Sample first section of code to be timed

   DO I=1,100
     CALL MOM
   ENDDO

!  Display stats

   TYPE *,'Stats for first section'
   CALL LIB$SHOW_TIMER

!  Zero second timer context

   CALL LIB$INIT_TIMER (TIMER_CONTEXT)

!  Sample second section of code to be timed

   DO I=1,1000
     CALL MOM
   ENDDO

!  Display stats

   TYPE *,'Stats for second section'
   CALL LIB$SHOW_TIMER (TIMER_CONTEXT)
   TYPE *,'Accumulated stats for two sections'
   CALL LIB$SHOW_TIMER

!  Re-Initialize second timer stats to 0

   CALL LIB$INIT_TIMER (TIMER_CONTEXT)

!  Sample Third section of code to be timed

   DO I=1,1000
     CALL MOM
   ENDDO

!  Display stats

   TYPE *,'Stats for third section'
   CALL LIB$SHOW_TIMER (TIMER_CONTEXT)
   TYPE *,'Accumulated stats for all sections'
   CALL LIB$SHOW_TIMER

 END PROGRAM TIMER

!  Sample subroutine performs enough processing so times aren't all 0.0

 SUBROUTINE MOM
   COMMON  BOO(10000)
   DOUBLE PRECISION BOO
   BOO = 0.5    ! Initialize all array elements to 0.5

   DO I=2,10000
      BOO(I)   = 4.0+(BOO(I-1)+1)*BOO(I)*COSD(BOO(I-1)+30.0)
      BOO(I-1) = SIND(BOO(I)**2)
   ENDDO

   RETURN

 END SUBROUTINE MOM

The LIB$xxxx_TIMER routines use a single default time when called without an argument. When you call LIB$xxxx_TIMER routines with an INTEGER argument whose initial value is 0 (zero), you enable use of multiple timers.

The LIB$INIT_TIMER routine must be called at the start of the timing. It can be called again at any time to reset (set to zero) the values.

In Example 5-1, LIB$INIT_TIMER is:

Called once at the start of the program without an argument. This initializes what will become accumulated statistics and starts the collection of the statistics. You can think of this as the first timer.
Called once at the start of each section with the INTEGER context argument TIMER_CONTEXT. This resets the values for the current section to zero and starts the collection of the statistics. You can think of this as the second timer, which gets reset for each section.

The LIB$SHOW_TIMER routine displays the timer values saved by LIB$INIT_TIMER to SYS$OUTPUT (or to a specified routine). Your program must call LIB$INIT_TIMER before LIB$SHOW_TIMER at least once (to start the timing).

Like LIB$INIT_TIMER:

Calling LIB$SHOW_TIMER without any arguments displays the default accumulated statistics.
Calling LIB$SHOW_TIMER with an INTEGER context variable (TIMER_CONTEXT) displays the statistics for the current section.

The free-format source file, TIMER.F90, might be compiled and linked as follows:

$ FORTRAN/FLOAT=IEEE_FLOAT TIMER
$ LINK TIMER

When the program is run (on a low-end Alpha system), it displays timing statistics for each section of the program as well as accumulated statistics:

$ RUN TIMER 
Stats for first section
 ELAPSED:    0 00:00:02.36  CPU: 0:00:02.21  BUFIO: 1  DIRIO: 0  FAULTS: 23
Stats for second section
 ELAPSED:    0 00:00:22.31  CPU: 0:00:22.09  BUFIO: 1  DIRIO: 0  FAULTS: 0
Accumulated stats for two sections
 ELAPSED:    0 00:00:24.68  CPU: 0:00:24.30  BUFIO: 5  DIRIO: 0  FAULTS: 27
Stats for third section
 ELAPSED:    0 00:00:22.24  CPU: 0:00:21.98  BUFIO: 1  DIRIO: 0  FAULTS: 0
Accumulated stats for all sections
 ELAPSED:    0 00:00:46.92  CPU: 0:00:46.28  BUFIO: 9  DIRIO: 0  FAULTS: 27

$

You might:

Run the program multiple times and average the results.
Use different compilation qualifiers to see which combination provides the best performance.

Instead of the LIB$xxxx_TIMER routines (specific to the OpenVMS operating system), you might consider modifying the program to call other routines within the program to measure execution time (but not obtain other process information). For example, you might use HP Fortran intrinsic procedures, such as SYSTEM_CLOCK, DATE_AND_TIME, and TIME.

For More Information:

On the LIB$ RTL routines, see the HP OpenVMS RTL Library (LIB$) Manual.
On HP Fortran intrinsic procedures, see the HP Fortran for OpenVMS Language Reference Manual.

Contents

Index

HP OpenVMS Systems Documentation

HP Fortran for OpenVMSUser Manual

Chapter 5Performance: Making Programs Run Faster

5.1 Software Environment and Efficient Compilation

5.1.1 Install the Latest Version of HP Fortran and Performance Products

5.1.2 Compile Using Multiple Source Files and Appropriate FORTRAN Qualifiers

5.1.3 Process Environment and Related Influences on Performance

5.2 Analyzing Program Performance

5.2.1 Measuring Performance Using LIB$xxxx_TIMER Routines or Command Procedures

5.2.1.1 The LIB$xxxx_TIMER Routines

HP Fortran for OpenVMS
User Manual

Chapter 5
Performance: Making Programs Run Faster