Compaq Fortran
User Manual for
Tru64 UNIX and
Linux Alpha Systems

Chapter 5
Performance: Making Programs Run Faster

This chapter contains the following topics:

Note

To invoke the Compaq Fortran compiler, use:

f90 on Tru64 UNIX Alpha systems
fort command on Linux Alpha systems

This chapter uses f90 to indicate invoking Compaq Fortran on both systems, so replace this command with fort if you are working on a Linux Alpha system.

To invoke the Compaq C compiler, use:

cc on Tru64 UNIX Alpha systems
ccc on Linux Alpha systems

This chapter uses cc to indicate invoking Compaq C on both systems, so replace this command with ccc if you are working on a Linux Alpha system.

5.1 Efficient Compilation and the Software Environment

Before you attempt to analyze and improve program performance, you should:

Obtain and install the latest version of Compaq Fortran, along with performance products that can improve application performance, such as the Compaq Extended Mathematical Library (CXML).
Use the f90 command (or, on Linux systems, the fort command) and its options in a manner that lets the Compaq Fortran compiler perform as many optimizations as possible to improve run-time performance.
Use certain performance capabilities provided by the Compaq Tru64 UNIX operating system.
Make sure that you correct any errors you might have encountered during the early stages of program development.

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:

The latest version of Compaq Fortran
New releases of the Compaq Fortran compiler and its associated run-time libraries may provide new features that improve run-time performance.
The Compaq Fortran run-time libraries shipped with Compaq Fortran are also shipped with the Compaq Tru64 UNIX operating system. Always install the Compaq Fortran subset with the highest subset number. This number is always available, for both Tru64 and Linux operating systems, at the following Web page:
http://www.compaq.com/fortran
If your application will be run on a Compaq Tru64 UNIX system other than your program development system, be sure to install the same (or later) version of the Compaq Fortran run-time environment on those systems.
You can obtain the appropriate Compaq Services software product maintenance contract to automatically receive new versions of Compaq Fortran. For information on more recent Compaq Fortran releases, contact the Compaq Customer Support Center (CSC) if you have the appropriate support contract, or contact your local Compaq sales representative.
When using a shared memory, multiprocessor system, you can choose either directed parallel processing or Compaq KAP Fortran/OpenMP for Tru64 UNIX.
Compaq KAP Fortran/OpenMP for Tru64 UNIX (TU*X ONLY)
Allows preprocessing of Compaq Fortran source files to improve their run-time performance. You can purchase this product from Compaq. See the KAP Optimizers Web site at:
http://www.compaq.com/hpc/software/kap.html
The KAP performance preprocessor also supports parallel processing using automatic and directed decomposition for a shared memory multiprocessor Alpha system.
You can do one of the following:
- Use the preprocessor-only kapf90 command to produce improved Fortran 95/90 source files before compiling them with the f90 command.
- Use the kf90 command to invoke the preprocessor, compiler, and linker to create an executable program.
For example, the following kf90 command:
- Specifies the KAP preprocessor be run for the free-form file for_cal.f90
- Recognizes the BLAS level 2 and 3 routines
- Searches the CXML library for unresolved references
- Compiles and links the resulting preprocessed source file:
  % kf90 -fkapargs='-lc=blas' for_cal.f90 -lcxml
For More Information:
- See Compaq KAP Fortran/OpenMP for Tru64 UNIX User Guide.
Compaq Extended Mathematical Library (CXML) for Compaq Tru64 UNIX Systems
See Chapter 13, Using the Compaq Extended Math Library (CXML).
Performance profiling and feedback tools provided with Compaq Tru64 UNIX
The standard set of U*X profiling and performance tools include prof , gprof , pixie (TU*X ONLY), cord , and the use of feedback files.
Compaq Tru64 UNIX Version 4.0 or later also includes:
- The Atom tool, which consists of a set of routines for creating custom-designed program-analysis tools.
- Prepackaged Atom-based program-analysis tools, which include the profiling tools pixie (TU*X ONLY) and hiprof .
For More Information:
- See atom(1).
- See the Compaq Tru64 UNIX Programmer's Guide.
System-wide performance products
Other products are not specific to a particular programming language or application, but can improve system-wide performance, such as minimizing disk device I/O and handling capacity planning. Such Tru64 UNIX products include DECRaid (shadowing and striping) and such POLYCENTER products as the Capacity Planner, Performance Solution, and Performance Advisor.
Adequate process limits and virtual memory space as well as proper system tuning are especially important when running large programs, such as those accessing large arrays.

For More Information:

About system-wide tuning and suggestions for other performance enhancements on Compaq Tru64 UNIX systems, see Compaq Tru64 UNIX System Tuning and Performance.

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:

% f90 -c -O1 sub2.f90 % f90 -c -O1 sub3.f90 % f90 -o main.out -g -O0 main.f90 sub2.o sub3.o

During the later stages of program development, you should specify multiple source files together and use an optimization level of at least -o4 on the f90 command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization ( -o4 ):

% f90 -o main.out main.f90 sub2.f90 sub3.f90

Compiling multiple source files lets the compiler examine more code for possible optimizations, which results in:

Inlining more procedures
More complete data flow analysis
Reducing the number of external references to be resolved during linking

For very large programs, compiling all source files together may not be practical. In such instances, consider compiling source files containing related routines together using multiple f90 commands, rather than compiling source files individually.

Table 5-1 shows f90 options that can improve performance. Most of these options do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.

Compaq Fortran performs certain optimizations unless you specify the appropriate f90 command options. Additional optimizations can be enabled or disabled using f90 command options.

Table 5-1 lists the f90 options that can directly improve run-time performance.

Table 5-1 Options That Affect Run-Time Performance
Option Names Description For More Information

-align keyword Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned. Section 5.4

-architecture keyword Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions. Section 3.5

-cord and -feedback file Uses a feedback file created during a previous compilation by specifying the -gen_feedback option. These options use the feedback file to improve run-time performance, optionally using cord to rearrange procedures. Section 5.3.5

-fast Sets the following performance-related options:
-align dcommons
-align sequence
-arch host
-assume bigarrays (TU*X ONLY)
-assume nozsize (TU*X ONLY)
-assume noaccuracy_sensitive (same as -fp_reorder )
-math_library fast
-tune host
See description of each option

-fp_reorder Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( -no_fp_reorder ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. Section 5.9.7

-gen_feedback Requests generated code that allows accurate feedback information for subsequent use of the -feedback file option (optionally with cord ). Using -gen_feedback changes the default optimization level from -o4 to -o0 . Section 5.3.5

-hpf num and related options (TU*X ONLY) Specifies that the code generated for this program will allow parallel execution on multiple processors Section 3.50

-inline all Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops. Section 5.9.3

-inline speed Inlines procedures that will improve run-time performance with a likely significant increase in program size. Section 5.9.3

-inline size Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level -o4 and -o5 . Section 5.9.3

-math_library fast Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. Section 3.61

-mp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.64

-o n ( -o0 to -o5 ) Controls the optimization level and thus the types of optimization performed. The default optimization level is -o4 , unless you specify -g2 , -g , or -gen_feedback , which changes the default to -o0 (no optimizations). Use -o5 to activate loop transformation optimizations. Section 5.8

-om (TU*X ONLY) Used with the -non_shared option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window. Section 3.73

-omp (TU*X ONLY) Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems Section 3.74

-p , -p1 Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.3

-pg Requests profiling information for the gprof tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. Section 5.3

-pipeline Activates the software pipelining optimization (a subset of -o4 ). Section 3.76

-speculate keyword (TU*X ONLY) Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions. Section 3.84

-transform_loops Activates a group of loop transformation optimizations (a subset of -o5 ). Section 3.89

-tune keyword Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of -tune keyword , the generated code will run correctly on all implementations of the Alpha architecture. Section 5.9.4

-unroll num Specifies the number of times a loop is unrolled ( num) when specified with optimization level -o3 or higher. If you omit -unroll num , the optimizer determines how many times loops are unrolled. Section 5.8.4.1

**Table 5-1 Options That Affect Run-Time Performance**
Option Names	Description	For More Information
`-align keyword`	Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned.	Section 5.4
`-architecture keyword`	Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions.	Section 3.5
`-cord` and `-feedback file`	Uses a feedback file created during a previous compilation by specifying the `-gen_feedback` option. These options use the feedback file to improve run-time performance, optionally using `cord` to rearrange procedures.	Section 5.3.5
`-fast`	Sets the following performance-related options: `-align dcommons` `-align sequence` `-arch host` `-assume bigarrays` (TUX ONLY)* `-assume nozsize` (TUX ONLY)* `-assume noaccuracy_sensitive` (same as `-fp_reorder` ) `-math_library fast` `-tune host`	See description of each option
`-fp_reorder`	Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( `-no_fp_reorder` ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs.	Section 5.9.7
`-gen_feedback`	Requests generated code that allows accurate feedback information for subsequent use of the `-feedback` file option (optionally with `cord` ). Using `-gen_feedback` changes the default optimization level from `-o4` to `-o0` .	Section 5.3.5
`-hpf num` and related options (TUX ONLY)*	Specifies that the code generated for this program will allow parallel execution on multiple processors	Section 3.50
`-inline all`	Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops.	Section 5.9.3
`-inline speed`	Inlines procedures that will improve run-time performance with a likely significant increase in program size.	Section 5.9.3
`-inline size`	Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level `-o4` and `-o5` .	Section 5.9.3
`-math_library fast`	Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions.	Section 3.61
`-mp` (TUX ONLY)*	Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems	Section 3.64
`-o n` ( `-o0` to `-o5` )	Controls the optimization level and thus the types of optimization performed. The default optimization level is `-o4` , unless you specify `-g2` , `-g` , or `-gen_feedback` , which changes the default to `-o0` (no optimizations). Use `-o5` to activate loop transformation optimizations.	Section 5.8
`-om` (TUX ONLY)*	Used with the `-non_shared` option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window.	Section 3.73
`-omp` (TUX ONLY)*	Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems	Section 3.74
`-p` , `-p1`	Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance.	Section 5.3
`-pg`	Requests profiling information for the `gprof` tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance.	Section 5.3
`-pipeline`	Activates the software pipelining optimization (a subset of `-o4` ).	Section 3.76
`-speculate keyword` (TUX ONLY)*	Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions.	Section 3.84
`-transform_loops`	Activates a group of loop transformation optimizations (a subset of `-o5` ).	Section 3.89
`-tune keyword`	Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of `-tune keyword` , the generated code will run correctly on all implementations of the Alpha architecture.	Section 5.9.4
`-unroll num`	Specifies the number of times a loop is unrolled ( num) when specified with optimization level `-o3` or higher. If you omit `-unroll num` , the optimizer determines how many times loops are unrolled.	Section 5.8.4.1

Table 5-2 lists options that can slow program performance. Some applications that require floating-point exception handling or rounding might need to use the -fpen and -fprm dynamic options. Other applications might need to use the -assume dummy_aliases or -vms options for compatibility reasons. Other options listed in Table 5-2 are primarily for troubleshooting or debugging purposes.

Table 5-2 Options that Slow Run-Time Performance
Option Names Description For More Information

-assume dummy_aliases Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify -assume dummy_aliases only for the called subprograms that depend on such aliases.
The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.
Section 5.9.8

-c If you use -c when compiling multiple source files, also specify -o output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple f90 commands or using -c without the -o output option. Section 2.1.6

-check bounds Generates extra code for array bounds checking at run time. Section 3.23

-check omp_bindings (TU*X ONLY) Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code. Section 3.26

-check overflow Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance. Section 3.28

-fpe n values greater than -fpe0 Using -fpe1 (TU*X ONLY) , -fpe2 (TU*X ONLY) , -fpe3 , or -fpe4 (TU*X ONLY) (or using the for_set_fpe routine to set equivalent exception handling) slows program execution. For programs that specify -fpe3 or -fpe4 (TU*X ONLY) , the impact on run-time performance can be significant. Section 3.44

-fprm dynamic (TU*X ONLY) Certain rounding modes and changing the rounding mode can slow program execution slightly. Section 3.46

-g , -g2 , -g3 Generates extra symbol table information in the object file. Specifying -g or -g2 also reduces the default level of optimization to -o0 . Section 3.48

-inline none
-inline manual Prevents the inlining of all procedures (except statement functions). Section 5.9.3

-o0 , -o1 , -o2 , or -o3 Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. Section 3.72 and Section 5.8

-synchronous_exceptions Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception. Section 3.86

-vms Controls certain VMS-related run-time defaults, including alignment. If you specify the -vms option, you may need to also specify the -align records option to obtain optimal run-time performance. Section 3.98

**Table 5-2 Options that Slow Run-Time Performance**
Option Names	Description	For More Information
`-assume dummy_aliases`	Forces the compiler to assume that dummy (formal) arguments to procedures share memory locations with other dummy arguments or with variables shared through use association, host association, or common block use. These program semantics slow performance, so you should specify `-assume dummy_aliases` only for the called subprograms that depend on such aliases. The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs.	Section 5.9.8
`-c`	If you use `-c` when compiling multiple source files, also specify `-o` output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple `f90` commands or using `-c` without the `-o` output option.	Section 2.1.6
`-check bounds`	Generates extra code for array bounds checking at run time.	Section 3.23
`-check omp_bindings` (TUX ONLY)*	Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code.	Section 3.26
`-check overflow`	Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance.	Section 3.28
`-fpe n` values greater than `-fpe0`	Using `-fpe1` (TUX ONLY)* , `-fpe2` (TUX ONLY)* , `-fpe3` , or `-fpe4` (TUX ONLY)* (or using the `for_set_fpe` routine to set equivalent exception handling) slows program execution. For programs that specify `-fpe3` or `-fpe4` (TUX ONLY)* , the impact on run-time performance can be significant.	Section 3.44
`-fprm dynamic` (TUX ONLY)*	Certain rounding modes and changing the rounding mode can slow program execution slightly.	Section 3.46
`-g` , `-g2` , `-g3`	Generates extra symbol table information in the object file. Specifying `-g` or `-g2` also reduces the default level of optimization to `-o0` .	Section 3.48
`-inline none` `-inline manual`	Prevents the inlining of all procedures (except statement functions).	Section 5.9.3
`-o0` , `-o1` , `-o2` , or `-o3`	Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger.	Section 3.72 and Section 5.8
`-synchronous_exceptions`	Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception.	Section 3.86
`-vms`	Controls certain VMS-related run-time defaults, including alignment. If you specify the `-vms` option, you may need to also specify the `-align records` option to obtain optimal run-time performance.	Section 3.98

For More Information:

On compiling multiple files, see Section 2.1.6.
On minimizing external references, see Section 11.1.1.

5.1.3 Process Shell Environment and Related Influences on Performance

Certain shell commands and system tuning can improve run-time performance:

Specify adequate process limits and do system tuning.
Especially when compiling or running large programs, check to make sure that process limits are adequate.
With the C shell ( csh ), use the limits command to display the limits of your process and increase specified limits. For more information, see csh(1).
With the Bourne, Korn, and bash (L*X ONLY) shells, use the ulimit command to display the limits of your process and increase specified limits. For more information, see sh(1) (Bourne shell), ksh(1) (Korn shell), or bash(1) (bash shell) (L*X ONLY).
Your system manager can tune the system for efficient use. For example, to monitor system use during program execution or compilation, a system manager can use vmstat .
For more information on system tuning, see your operating system documentation.
Redirect scrolled text.
For programs that display a lot of text, consider redirecting text that is usually displayed on stdout to a file. Displaying a lot of text will slow down execution; scrolling text in a terminal window on a workstation can cause an I/O bottleneck (increased elapsed time) and use some CPU time.
The following commands show how to run the program more efficiently by redirecting output to a file and then displaying the program output:
# myprog > results.lis # more results.lis
When compiling a program that contains a substantial amount of C language code, be aware that you can specify most cc options on the f90 command line, including several that can improve performance. You can also compile C code using the cc -c option, and then use the f90 command to compile and link the Compaq Fortran source files with the C language object files.
Recall from Chapter 2 and Chapter 3 that the f90 and cc commands invoke the Compaq Fortran compiler and Compaq C compiler, respectively, on Tru64 UNIX Alpha systems. The corresponding commands on Linux Alpha systems are fort and ccc .

For More Information:

On system tuning and cc options related to performance, see your operating system documentation and the appropriate reference pages.

5.2 Using the time Command to Measure Performance

Use the time command to provide information about program performance.

Run program timings when other users are not active. Your timing results can be affected by one or more CPU-intensive processes also running while doing your timings.

Try to run the program under the same conditions each time to provide the most accurate results, especially when comparing execution times of a previous version of the same program. Use the same CPU system (model, amount of memory, version of the operating system, and so on) if possible.

If you do need to change systems, you should measure the time using the same version of the program on both systems, so you know each system's effect on your timings.

For programs that run for less than a few seconds, run several timings to ensure that the results are not misleading. Overhead functions like loading shared libraries might influence short timings considerably.

Using the form of the time command that specifies the name of the executable program provides the following:

The elapsed, real, or "wall clock" time, which will be greater than the total charged actual CPU time.
Charged actual CPU time, shown for both system and user execution. The total actual CPU time is the sum of the actual user CPU time and actual system CPU time.

In the following example timings, the sample program being timed displays the following line:

Average of all the numbers is: 4368488960.000000

Using the Bourne shell, the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:

$ time a.out Average of all the numbers is: 4368488960.000000 real 0m2.46s user 0m0.61s sys 0m0.58s

Using the C shell, the following program timing reports 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use), about 4 seconds (0:04) of elapsed time, the use of 28% of available CPU time, and other information:

% time a.out Average of all the numbers is: 4368488960.000000 0.61u 0.58s 0:04 28% 78+424k 9+5io 0pf+0w

Using the bash shell (L*X ONLY), the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:

[user@system user]$ time ./a.out Average of all the numbers is: 4368488960.000000 elapsed 0m2.46s user 0m0.61s sys 0m0.58s

Timings that show a large amount of system time may indicate a lot of time spent doing I/O, which might be worth investigating.

If your program displays a lot of text, you can redirect the output from the program on the time command line. (See Section 5.1.3.) Redirecting output from the program will change the times reported because of reduced screen I/O.

For more information, see time(1).

In addition to the time command, you might consider modifying the program to call routines within the program to measure execution time. For example:

Compaq Fortran intrinsic procedures, such as SYSTEM_CLOCK, DATE_AND_TIME, and TIME (see the Compaq Fortran Language Reference Manual)
Library routines, such as etime or time (see Section 12.2, 3f Routines or intro(3f)).

5.3 Using Profiling Tools

To generate profiling information, use the f90 compiler and the prof , gprof , and pixie (TU*X ONLY) tools.

Profiling identifies areas of code where significant program execution time is spent. Along with the f90 command, use the prof and pixie (TU*X ONLY) tools to generate the following profile information:

The CPU time spent in the different routines of the program, or program counter sampling. This type of profiling uses prof .
The manner in which routines are called by other routines, or call graph information. This type of profiling uses gprof .
The execution of basic blocks, called basic block counting. A basic block is a sequence of instructions entered only at the beginning and exited only at the end (no branches). This provides statistics on individual lines of code and is influenced by such optimizations as loop unrolling. This type of profiling uses prof and pixie (TU*X ONLY).
The estimated number of CPU cycles spent for each source line in one or more procedures, or source line CPU cycle use. This type of profiling uses prof and pixie (TU*X ONLY).

Once you have determined those sections of code where most of the program execution time is spent, examine these sections for coding efficiency. Suggested guidelines for improving source code efficiency are provided in Section 5.7.

Along with profiling, you can consider generating a listing file with annotations of optimizations, by specifying the -V and -annotations options.

5.3.1 Program Counter Sampling (prof)

To obtain program counter sampling data, perform the following steps:

Use the f90 command option -p to compile and link the program:
% f90 -p -O3 -o profsample profsample.f90
If you specify the -c option to prevent linking, you must specify the -p option when you link the program:
% f90 -c -O3 profsample.f90 % f90 -p -O3 -o profsample profsample.o
Consider specifying optimization level -o3 or -inline manual to minimize the inlining of procedures. Once inlined, procedures are not listed as separate routines but as part of the routine into which they have been inlined. Allowing full inlining would result in program counter sampling for a small number of (usually) large routines. This might not help you locate areas of the program where significant program execution time is spent.
Execute the profiled program:
% profsample
During program execution, profiling data is written to a profile data file, whose default name is mon.out . You can execute the program multiple times to generate multiple profile data files, which can be averaged. Use the PROFDIR environment variable to request a different profile data file name.
Run the prof command, which formats the profiling data and displays it in a readable format:
% prof profsample mon.out

You can limit the report created by prof by using prof command options, such as -only , -exclude , or -quit .

For example, if you only want reports on procedures calc_max and calc_min , you could use the following command line to read the profile data file named mon.out :

% prof -only calc_max -only calc_min profsample

The time spent in particular areas of code is reported by prof in the form of a percentage of the total CPU time spent by the program. To reduce the size of the report, you can either:

Request that only certain procedures be included (by using the -only option).
Exclude certain procedures (by using the -exclude option).

When you use the -only or -exclude options, the percentages are still based on all procedures of the application. To obtain percentages calculated by prof that are based on only those procedures included in the report, use the -only and -exclude options (use an uppercase initial letter in the option name).

You can use the -quit option to reduce the amount of information reported. For example, the following command prints information on only the five most time-consuming procedures:

% prof -quit 5 profsample

The following command limits information only to those procedures using 10% or more of the total execution time:

% prof -quit 10% profsample

For More Information:

On prof , see prof(1) and the Compaq Tru64 UNIX Programmer's Guide.

5.3.2 Call Graph Sampling (gprof)

To obtain call graph information, use the gprof tool. Perform the following steps:

Use the command-line option -pg when you compile and link the program:
% f90 -pg -O3 -o profsample profsample.for
If you specify the -c option to prevent linking, you must then specify the -pg option both when you compile and link the program:
% f90 -pg -c -O3 profsample.f90 % f90 -pg -O3 -o profsample profsample.f90
Execute the profiled program:
% profsample
During execution, profiling data is saved to the file gmon.out , unless the environment variable PROFDIR is set.
Run the formatting program gprof :
% gprof profsample gmon.out

The output produced by gprof includes:

Call graph profile
Timing profile (similar to that produced by prof )
Index

For More Information:

On using gprof and its output, see the Compaq Tru64 UNIX Programmer's Guide.

5.3.3 Basic Block Counting (pixie and prof)

To obtain basic block counting information, perform the following steps:

Compile and link the program without the -p option:
% f90 -O3 -o profsample profsample.f90
Consider specifying optimization level -o3 or -inline manual to minimize the inlining of procedures (once inlined, procedures are not listed as separate routines but as part of the routine into which they are inlined).
Run the profiling command pixie : (TU*X ONLY)
% atom -tools pixie profsample
The pixie command creates: (TU*X ONLY)
- A program named profsample.pixie that is equivalent to profsample but contains additional code for counting the execution of each basic block.
- A file named profsample.addrs , which contains the address of each basic block.
Execute the profiled program profsample.pixie generated by pixie :
% profsample.pixie
This program creates the file profsample.counts , which contains the basic block counts.
Run prof with the -pixie option, to extract and display information from the profsample.addrs and profsample.counts files:
% prof -pixie profsample
When you specify the -pixie option (TU*X ONLY), the prof command searches for files with a suffix of .addrs and .counts (in this case profsample.addrs and profsample.counts ).
You can reduce the amount of information in the report created by prof by using the -only , -exclude , -quit , and related options.

To create multiple profile data files, run the program multiple times.

For More Information:

On prof , gprof , and pixie (TU*X ONLY), see prof(1), gprof(1), pixie(1), and the Compaq Tru64 UNIX Programmer's Guide.

5.3.4 Source Line CPU Cycle Use (prof and pixie)

You use the same files created by the pixie command (see Section 5.3.3) for basic block counting to estimate the number of CPU cycles used to execute each source file line.

To view a report of the number of CPU cycles estimated for each source file line, use the following options with the prof command:

The -pixie (TU*X ONLY) option is required to obtain source line information.
The -heavy option prints an entry for each source code line, including the number of CPU cycles used by that line. Entries are sorted in descending order of CPU cycles and should be limited by using the prof command options that limit the report size, such as -quit , -only , or -exclude .
The -lines option requests source line information, but in the order in which the lines occur in the program (not sorted in descending order of CPU cycles).

Depending on the level of optimization chosen, certain source lines might be optimized away.

The CPU cycle use estimates are based primarily on the instruction type and its operands and do not include memory effects such as cache misses or translation buffer fills.

For example, the following command sequence uses:

The f90 and pixie (TU*X ONLY) commands to create the necessary files.
The prof command to request source line CPU cycle use information for the procedure named calc_max ( -only option), sorted in descending order of CPU cycles ( -heavy option):

% f90 -o profsample profsample.f90 % atom -tools pixie profsample % profsample.pixie % prof -pixie -heavy -only calc_max profsample

5.3.5 Creating and Using Feedback Files and Optionally cord

You can create a feedback file by using a series of commands. Once created, you can specify a feedback file in a subsequent compilation with the f90 command option -feedback . You can also request that cord use the feedback file to rearrange procedures, by specifying the -cord option on the f90 command line.

To create the feedback file, complete these steps:

Compile and link the program. Omit the -p option, but specify the -gen_feedback option:
% f90 -o profsample -gen_feedback profsample.f90
The -gen_feedback option changes the default optimization level to -o0 .
To include libraries in the profiling output, specify -non_shared .
Execute the profiling command pixie (TU*X ONLY):
% pixie profsample
The pixie command creates:
- A program named profsample.pixie that is equivalent to profsample but contains additional code for counting the execution of each basic block.
- A file named profsample.addrs , which contains the address of each basic block.
Execute the profiled program profsample.pixie generated by pixie :
% profsample.pixie
This program creates the file profsample.counts , which contains the basic block counts.
Run prof with the -pixie and -feedback options:
% prof -pixie -feedback profsample.feedback profsample
This prof command creates the feedback file profsample.feedback .

You can use the feedback file as input to the f90 compiler:

% f90 -feedback profsample.feedback -o profsample profsample.f90

The feedback file provides the compiler with actual execution information, which the compiler can use to improve such optimizations as inlining function calls.

Specify the desired optimization level ( -on option) for the f90 command with the -feedback name option (in this example the default is -o4 ).

You can use the feedback file as input to the f90 compiler and cord , as follows:

% f90 -cord -feedback profsample.feedback -o profsample profsample.f90

The -cord option invokes cord , which reorders the procedures in an executable program to improve program execution, using the information in the specified feedback file. Specify the desired optimization level ( -on option) for the f90 command with the -feedback name option (in this example -o4 ).

5.3.6 Atom Toolkit

(TU*X ONLY) The Atom toolkit includes a programmable instrumentation tool and several prepackaged tools. The prepackaged tools include:

hiprof
Produces a flat profile of an application that shows the execution time spent in a given procedure, and a hierarchical profile that shows the execution time spent in a given procedure and all of its descendents.
pixie
Produces a profile of an application, by procedure, source line, or instruction. It partitions the application into basic blocks and counts the number of times each basic block is executed.
third
Performs memory access checks and detects memory leaks in an application.

To invoke atom tools, use the following general command syntax:

% atom -tool tool-name ...)

Atom does not work on programs built with the -om option.

For More Information:

See the Compaq Tru64 UNIX Programmers Guide.
See atom(1), hiprof(5), pixie(5), and third(5).

5.4 Data Alignment Considerations

For optimal performance on Alpha systems, make sure your data is aligned naturally.

A natural boundary is a memory address that is a multiple of the data item's size (data type sizes are described in Table 9-1). For example, a REAL (KIND=8) data item aligned on natural boundaries has an address that is a multiple of 8. An array is aligned on natural boundaries if all of its elements are.

All data items whose starting address is on a natural boundary are naturally aligned. Data not aligned on a natural boundary is called unaligned data.

Although the Compaq Fortran compiler naturally aligns individual data items when it can, certain Compaq Fortran statements (such as EQUIVALENCE) can cause data items to become unaligned (see Section 5.4.1).

Although you can use the f90 command -align keyword options to ensure naturally aligned data, you should check and consider reordering data declarations of data items within common blocks and structures. Within each common block, derived type, or record structure, carefully specify the order and sizes of data declarations to ensure naturally aligned data. Start with the largest size numeric items first, followed by smaller size numeric items, and then nonnumeric (character) data.

5.4.1 Causes of Unaligned Data and Ensuring Natural Alignment

Common blocks (COMMON statement), derived-type data, and Compaq Fortran 77 record structures (RECORD statement) usually contain multiple items within the context of the larger structure.

The following declaration statements can force data to be unaligned:

Common blocks (COMMON statement)
The order of variables in the COMMON statement determines their storage order.
Unless you are sure that the data items in the common block will be naturally aligned, specify either the -align commons or -align dcommons option, depending on the largest data size used.
See Section 5.4.3.1, Arranging Data Items in Common Blocks.
Derived-type (user-defined) data
Derived-type data members are declared after a TYPE statement.
If your data includes derived-type data structures, you should use the -align records option, unless you are sure that the data items in derived-type data structures will be naturally aligned.
If you omit the SEQUENCE statement, the -align records option (default) ensures all data items are naturally aligned.
If you specify the SEQUENCE statement, the -align record option is prevented from adding necessary padding to avoid unaligned data (data items are packed) unless you specify the -align sequence option. When you use SEQUENCE, you should specify data declaration order such that all data items are naturally aligned.
See Section 5.4.3.2, Arranging Data Items in Derived-Type Data.
Compaq Fortran record structures (RECORD and STRUCTURE statements)
Compaq Fortran record structures usually contain multiple data items. The order of variables in the STRUCTURE statement determines their storage order. The RECORD statement names the record structure.
If your data includes Compaq Fortran record structures, you should use the -align records option, unless you are sure that the data items in derived-type data and Compaq Fortran record structures will be naturally aligned.
See Section 5.4.3.3, Arranging Data Items in Compaq Fortran Record Structures.
EQUIVALENCE statements
EQUIVALENCE statements can force unaligned data or cause data to span natural boundaries. For more information, see the Compaq Fortran Language Reference Manual.

To avoid unaligned data in a common block, derived-type data, or record structure (extension), use one or both of the following:

For new programs or for programs where the source code declarations can be modified easily, plan the order of data declarations with care. For example, you should order variables in a COMMON statement such that numeric data is arranged from largest to smallest, followed by any character data (see the data declaration rules in Section 5.4.3).
For existing programs where source code changes are not easily done or for array elements containing derived-type or record structures, you can use command line options to request that the compiler align numeric data by adding padding spaces where needed.

Other possible causes of unaligned data include unaligned actual arguments and arrays that contain a derived-type structure or Compaq Fortran record structure.

When actual arguments from outside the program unit are not naturally aligned, unaligned data access will occur. Compaq Fortran assumes all passed arguments are naturally aligned and has no information at compile time about data that will be introduced by actual arguments during program execution.

For arrays where each array element contains a derived-type structure or Compaq Fortran record structure, the size of the array elements may cause some elements (but not the first) to start on an unaligned boundary.

Even if the data items are naturally aligned within a derived-type structure without the SEQUENCE statement or a record structure, the size of an array element might require use of f90 -align options to supply needed padding to avoid some array elements being unaligned.

If you specify -align norecords or specify -vms without -align records , no padding bytes are added between array elements. If array elements each contain a derived-type structure with the SEQUENCE statement, array elements are packed without padding bytes regardless of the f90 command options specified. In this case, some elements will be unaligned.

When -align records option is in effect, the number of padding bytes added by the compiler for each array element is dependent on the size of the largest data item within the structure. The compiler determines the size of the array elements as an exact multiple of the largest data item in the derived-type structure without the SEQUENCE statement or a record structure. The compiler then adds the appropriate number of padding bytes.

For instance, if a structure contains an 8-byte floating-point number followed by a 3-byte character variable, each element contains five bytes of padding (16 is an exact multiple of 8). However, if the structure contains one 4-byte floating-point number, one 4-byte integer, followed by a 3-byte character variable, each element would contain one byte of padding (12 is an exact multiple of 4).

For More Information:

On the -align keyword options, see Section 5.4.4.

5.4.2 Checking for Inefficient Unaligned Data

During compilation, the Compaq Fortran compiler naturally aligns as much data as possible. Exceptions that can result in unaligned data are described in Section 5.4.1.

Because unaligned data can slow run-time performance, it is worthwhile to:

Double-check data declarations within common block, derived-type data, or record structures to ensure all data items are naturally aligned (see the data declaration rules in Section 5.4.3). Using modules to contain data declarations can ensure consistent alignment and use of such data.
Avoid the EQUIVALENCE statement or use it in a manner that cannot cause unaligned data or data spanning natural boundaries.
Ensure that passed arguments from outside the program unit are naturally aligned.
Check that the size of array elements containing at least one derived-type data or record structure (extension) cause array elements to start on aligned boundaries (see Section 5.4.1).

There are two ways unaligned data might be reported:

During compilation, warning messages are issued for any data items that are known to be unaligned (unless you specify the -warn noalignments option).
During program execution, warning messages are issued for any data that is detected as unaligned. The message includes the address of the unaligned access. You can use the ladebug debugger to locate unaligned data.
The following run-time message shows that:
- The statement accessing the unaligned data (program counter) is located at 3ff80805d60
- The unaligned data is located at address 140000154
Unaligned access pid=24821 <a.out> va=140000154, pc=3ff80805d60, ra=1200017bc
To check where the address is located, use the debugger as described in Section 4.10.
To suppress unaligned access run-time messages, use the uac command (see uac(1)).

5.4.3 Ordering Data Declarations to Avoid Unaligned Data

For new programs or when the source declarations of an existing program can be easily modified, plan the order of your data declarations carefully to ensure the data items in a common block, derived-type data, record structure, or data items made equivalent by an EQUIVALENCE statement will be naturally aligned.

Use the following rules to prevent unaligned data:

Always define the largest size numeric data items first.
If your data includes a mixture of character and numeric data, place the numeric data first.
Add small data items of the correct size (or padding) before otherwise unaligned data to ensure natural alignment for the data that follows.

When declaring data, consider using explicit length declarations, such as specifying a KIND parameter. For example, specify INTEGER(KIND=4) (or INTEGER(4)) rather than INTEGER. If you do use a default length (such as INTEGER, LOGICAL, COMPLEX, and REAL), be aware that the compiler options -integer_size and -real_size can change the size of an individual field's data declaration size and thus can alter the data alignment of a carefully planned order of data declarations.

Using the suggested data declaration guidelines minimizes the need to use the -align keyword options to add padding bytes to ensure naturally aligned data. In cases where the -align keyword options are still needed, using the suggested data declaration guidelines can minimize the number of padding bytes added by the compiler.

5.4.3.1 Arranging Data Items in Common Blocks

The order of data items in a COMMON statement determine the order in which the data items are stored. Consider the following declaration of a common block named X:

LOGICAL (KIND=2) FLAG INTEGER IARRY_I(3) CHARACTER(LEN=5) NAME_CH COMMON /X/ FLAG, IARRY_I(3), NAME_CH

As shown in Figure 5-1, if you omit the appropriate f90 command options, the common block will contain unaligned data items beginning at the first array element of IARRY_I.

Figure 5-1 Common Block with Unaligned Data

As shown in Figure 5-2, if you compile the program units that use the common block with the -align commons options, data items will be naturally aligned.

Figure 5-2 Common Block with Naturally Aligned Data

Because the common block X contains data items whose size is 32 bits or smaller, specify -align commons . If the common block contains data items whose size might be larger than 32 bits (such as REAL (KIND=8) data), use -align dcommons .

If you can easily modify the source files that use the common block data, define the numeric variables in the COMMON statement in descending order of size and place the character variable last. This provides more portability, ensures natural alignment without padding, and does not require the f90 command options -align commons or -align dcommons :

LOGICAL (KIND=2) FLAG INTEGER IARRY_I(3) CHARACTER(LEN=5) NAME_CH COMMON /X/ IARRY_I(3), FLAG, NAME_CH

As shown in Figure 5-3, if you arrange the order of variables from largest to smallest size and place character data last, the data items will be naturally aligned.

Figure 5-3 Common Block with Naturally Aligned Reordered Data

When modifying or creating all source files that use common block data, consider placing the common block data declarations in a module so the declarations are consistent. If the common block is not needed for compatibility (such as file storage or Compaq Fortran 77 use), you can place the data declarations in a module without using a common block.

5.4.3.2 Arranging Data Items in Derived-Type Data

Like common blocks, derived-type data may contain multiple data items (members).

Data item components within derived-type data will be naturally aligned on up to 64-bit boundaries, with certain exceptions related to the use of the SEQUENCE statement and f90 options. See Section 5.4.4 for information about these exceptions.

Compaq Fortran stores a derived data type as a linear sequence of values, as follows:

If you specify the SEQUENCE statement, the first data item is in the first storage location and the last data item is in the last storage location. The data items appear in the order in which they are declared. The f90 options have no effect on unaligned data, so data declarations must be carefully specified to naturally align data.
The -align sequence option specifically aligns data items in a SEQUENCE derived-type on natural boundaries.
If you omit the SEQUENCE statement, Compaq Fortran adds the padding bytes needed to naturally align data item components, unless you specify the -align norecords option.

Consider the following declaration of array CATALOG_SPRING of derived-type PART_DT:

MODULE DATA_DEFS TYPE PART_DT INTEGER IDENTIFIER REAL WEIGHT CHARACTER(LEN=15) DESCRIPTION END TYPE PART_DT TYPE (PART_DT) CATALOG_SPRING(30) . . . END MODULE DATA_DEFS

As shown in Figure 5-4, the largest numeric data items are defined first and the character data type is defined last. There are no padding characters between data items and all items are naturally aligned. The trailing padding byte is needed because CATALOG_SPRING is an array; it is inserted by the compiler when the -align records option is in effect.

Figure 5-4 Derived-Type Naturally Aligned Data (in CATALOG_SPRING : ( ,))

5.4.3.3 Arranging Data Items in Compaq Fortran Record Structures

Compaq Fortran supports record structures provided by Compaq Fortran. Compaq Fortran record structures use the RECORD statement and optionally the STRUCTURE statement, which are extensions to the FORTRAN-77 and Fortran 95/90 standards. The order of data items in a STRUCTURE statement determine the order in which the data items are stored.

Compaq Fortran stores a record in memory as a linear sequence of values, with the record's first element in the first storage location and its last element in the last storage location. Unless you specify -align norecords , padding bytes are added if needed to ensure data fields are naturally aligned.

The following example contains a structure declaration, a RECORD statement, and diagrams of the resulting records as they are stored in memory:

STRUCTURE /STRA/ CHARACTER*1 CHR INTEGER*4 INT END STRUCTURE . . . RECORD /STRA/ REC

Figure 5-5 shows the memory diagram of record REC for naturally aligned records.

Figure 5-5 Memory Diagram of REC for Naturally Aligned Records

5.4.4 Options Controlling Alignment

The following options control whether the Compaq Fortran compiler adds padding (when needed) to naturally align multiple data items in common blocks, derived-type data, and Compaq Fortran record structures:

The -align commons option requests that data in common blocks be aligned on up to 4-byte boundaries, by adding padding bytes as needed.
Unless you specify -fast , the default is -align nocommons or arbitrary byte alignment of common block data. In this case, unaligned data can occur unless the order of data items specified in the COMMON statement places the largest numeric data item first, followed by the next largest numeric data (and so on), followed by any character data.
The -align dcommons option requests that data in common blocks be aligned on up to 8-byte boundaries, by adding padding bytes as needed.
Unless you specify -fast , the default is -align nodcommons or arbitrary byte alignment of data items in a common data.
Specify the -align dcommons option for applications that use common blocks, unless your application has no unaligned data or, if the application might have unaligned data, all data items are four bytes or smaller. For applications that use common blocks where all data items are four bytes or smaller, you can specify -align commons instead of -align dcommons .
The -align norecords option requests that multiple data items in derived-type data and record structures (a Compaq Fortran extension) be aligned arbitrarily on byte boundaries instead of being naturally aligned. The default is -align records .
The -align records option requests that multiple data items in record structures (extension) and derived-type data without the SEQUENCE statement be naturally aligned, by adding padding bytes as needed.
The -align recnbyte option requests that fields of records and components of derived types be aligned on either the size byte boundary specified or the boundary that will naturally align them, whichever is smaller. This option does not affect whether common blocks are naturally aligned or packed.
The -align sequence option controls alignment of derived types with the SEQUENCE attribute.
The default -align nosequence option means that derived types with the SEQUENCE attribute are packed regardless of any other alignment rules. Note that -align none implies -align nosequence .
The -align sequence option means that derived types with the SEQUENCE attribute obey whatever alignment rules are currently in use. Consequently, since -align records is a default value, then -align sequence alone on the command line will cause the fields in these derived types to be naturally aligned. Note that -fast and -align all imply -align sequence .

The default behavior is that multiple data items in derived-type data and record structures will be naturally aligned; data items in common blocks will not ( -align records with -align nocommons ). In derived-type data, using the SEQUENCE statement prevents -align records from adding needed padding bytes to naturally align data items.

If your command line includes the -std , -std90 , or -std95 options, then the compiler ignores -align dcommons and -align sequence . See Section 3.85.

5.5 Using Arrays Efficiently

The following sections discuss:

5.5.1 Accessing Arrays Efficiently

On Alpha systems, many of the array access efficiency techniques described in this section are applied automatically by the Compaq Fortran loop transformation optimizations (see Section 5.8.7) or by the Compaq KAP Fortran/OpenMP for Tru64 UNIX Systems performance preprocessor (described in Section 5.1.1).

Several aspects of array use can improve run-time performance:

The fastest array access occurs when contiguous access to the whole array or most of an array occurs. Perform one or a few array operations that access all of the array or major parts of an array instead of numerous operations on scattered array elements.
Rather than use explicit loops for array access, use elemental array operations, such as the following line that increments all elements of array variable A:
A = A + 1.
When reading or writing an array, use the array name and not a DO loop or an implied DO-loop that specifies each element number. Fortran 95/90 array syntax allows you to reference a whole array by using its name in an expression. For example:
REAL :: A(100,100) A = 0.0 A = A + 1. ! Increment all elements of A by 1 . . . WRITE (8) A ! Fast whole array use
Similarly, you can use derived-type array structure components, such as:
TYPE X INTEGER A(5) END TYPE X . . . TYPE (X) Z WRITE (8) Z%A ! Fast array structure component use

Make sure multidimensional arrays are referenced using proper array syntax and are traversed in the natural ascending storage order, which is column-major order for Fortran. With column-major order, the leftmost subscript varies most rapidly with a stride of one. Whole array access uses column-major order.
Avoid row-major order, as is done by C, where the rightmost subscript varies most rapidly.
For example, consider the nested DO loops that access a two-dimension array with the J loop as the innermost loop:

INTEGER X(3,5), Y(3,5), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Inefficient row-major storage order END DO ! (rightmost subscript varies fastest) END DO . . . END PROGRAM

Since J varies the fastest and is the second array subscript in the expression X (I,J), the array is accessed in row-major order.
To make the array accessed in natural column-major order, examine the array algorithm and data being modified.
Using arrays X and Y, the array can be accessed in natural column-major order by changing the nesting order of the DO loops so the innermost loop variable corresponds to the leftmost array dimension:

INTEGER X(3,5), Y(3,5), I, J Y = 0 DO J=1,5 ! J outer loop varies slowest DO I=1,3 ! I inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM

The Compaq Fortran whole array access ( X = Y + 1 ) uses efficient column major order. However, if the application requires that J vary the fastest or if you cannot modify the loop order without changing the results, consider modifying the application program to use a rearranged order of array dimensions. Program modifications include rearranging the order of:

Dimensions in the declaration of the arrays X(5,3) and Y(5,3)
The assignment of X(J,I) and Y(J,I) within the DO loops
All other references to arrays X and Y

In this case, the original DO loop nesting is used where J is the innermost loop:

INTEGER X(5,3), Y(5,3), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (J,I) = Y(J,I) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM

Code written to access multidimensional arrays in row-major order (like C) or random order can often make inefficient use of the CPU memory cache. For more information on using natural storage order during record I/O operations, see Section 5.6.3.

Use the available Fortran 95/90 array intrinsic procedures rather than create your own.
Whenever possible, use Fortran 95/90 array intrinsic procedures instead of creating your own routines to accomplish the same task. Fortran 95/90 array intrinsic procedures are designed for efficient use with the various Compaq Fortran run-time components.
Using the standard-conforming array intrinsics can also make your program more portable.
With multidimensional arrays where access to array elements will be noncontiguous, avoid leftmost array dimensions that are a power of two (such as 256, 512).
Since the cache sizes are a power of 2, array dimensions that are also a power of 2 may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make little use of the cache. This does not apply to contiguous sequential access or whole array access.
One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:
REAL A (512,100) DO I = 2,511 DO J = 2,99 A(I,J)=(A(I+1,J-1) + A(I-1, J+1)) * 0.5 END DO END DO
In this code, array A has a leftmost dimension of 512, a power of two. The innermost loop accesses the rightmost dimension (row major), causing inefficient access. Increasing the leftmost dimension of A to 520 (REAL A (520,100)) allows the loop to provide better performance, but at the expense of some unused elements.
Because loop index variables I and J are used in the calculation, changing the nesting order of the DO loops changes the results.
For More Information:
- On arrays and their data declaration statements, see the Compaq Fortran Language Reference Manual.

5.5.2 Passing Array Arguments Efficiently

In Fortran 95/90, there are two general types of array arguments:

Explicit-shape arrays used with FORTRAN 77.
These arrays have a fixed rank and extent that is known at compile time. Other dummy argument (receiving) arrays that are not deferred-shape (such as assumed-size arrays) can be grouped with explicit-shape array arguments.
Deferred-shape arrays introduced with Fortran 95/90.
Types of deferred-shape arrays include array pointers and allocatable arrays. Assumed-shape array arguments generally follow the rules about passing deferred-shape array arguments.

When passing arrays as arguments, either the starting (base) address of the array or the address of an array descriptor is passed:

When using explicit-shape (or assumed-size) arrays to receive an array, the starting address of the array is passed.
When using deferred-shape or assumed-shape arrays to receive an array, the address of the array descriptor is passed (the compiler creates the array descriptor).

Passing an assumed-shape array or array pointer to an explicit-shape array can slow run-time performance. This is because the compiler needs to create an array temporary for the entire array. The array temporary is created because the passed array may not be contiguous and the receiving (explicit-shape) array requires a contiguous array. When an array temporary is created, the size of the passed array determines whether the impact on slowing run-time performance is slight or severe.

Table 5-3 summarizes what happens with the various combinations of array types. The amount of run-time performance inefficiency depends on the size of the array.

Table 5-3 Output Argument Array Types
Input Arguments Array Types Explicit-Shape Arrays Deferred-Shape and Assumed-Shape Arrays

Explicit-shape arrays Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block.

Deferred-shape and assumed-shape arrays When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.
When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible.
Uses an array temporary. Does not pass an array descriptor. Interface block optional.
Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block.

**Table 5-3 Output Argument Array Types**
Input Arguments Array Types	Explicit-Shape Arrays	Deferred-Shape and Assumed-Shape Arrays
Explicit-shape arrays	Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.	Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block.
Deferred-shape and assumed-shape arrays	When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible. Uses an array temporary. Does not pass an array descriptor. Interface block optional.	Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block.

5.6 Improving Overall I/O Performance

Improving overall I/O performance can minimize both device I/O and actual CPU time. The techniques listed in this section can greatly improve performance in many applications.

A bottleneck limits the maximum speed of execution by being the slowest process in an executing program. In some programs, I/O is the bottleneck that prevents an improvement in run-time performance. The key to relieving I/O bottlenecks is to reduce the actual amount of CPU and I/O device time involved in I/O.

Bottlenecks can be caused by one or more of the following:

A dramatic reduction in CPU time without a corresponding improvement in I/O time
Such coding practices as:
- Unnecessary formatting of data and other CPU-intensive processing
- Unnecessary transfers of intermediate results
- Inefficient transfers of small amounts of data
- Application requirements

Improved coding practices can minimize actual device I/O, as well as the actual CPU time.

Compaq offers software solutions to system-wide problems like minimizing device I/O delays (see Section 5.1.1).

5.6.1 Use Unformatted Files Instead of Formatted Files

Use unformatted files whenever possible. Unformatted I/O of numeric data is more efficient and more precise than formatted I/O. Native unformatted data does not need to be modified when transferred and will take up less space on an external file.

Conversely, when writing data to formatted files, formatted data must be converted to character strings for output, less data can transfer in a single operation, and formatted data may lose precision if read back into binary form.

To write the array A(25,25) in the following statements, S₁ is more efficient than S₂:

S₁ WRITE (7) A S₂ WRITE (7,100) A 100 FORMAT (25(' ',25F5.21))

Although formatted data files are more easily ported to other systems, Compaq Fortran can convert unformatted data in several formats (see Chapter 10).

5.6.2 Write Whole Arrays or Strings

The general guidelines about array use discussed in Section 5.5 also apply to reading or writing an array with an I/O statement.

To eliminate unnecessary overhead, write whole arrays or strings at one time rather than individual elements at multiple times. Each item in an I/O list generates its own calling sequence. This processing overhead becomes most significant in implied-DO loops. When accessing whole arrays, use the array name (Fortran 95/90 array syntax) instead of using implied-DO loops.

5.6.3 Write Array Data in the Natural Storage Order

Use the natural ascending storage order whenever possible. This is column-major order, with the leftmost subscript varying fastest and striding by 1. (See Section 5.5.1, Accessing Arrays Efficiently.) If a program must read or write data in any other order, efficient block moves are inhibited.

If the whole array is not being written, natural storage order is the best order possible.

If you must use an unnatural storage order, in certain cases it might be more efficient to transfer the data to memory and reorder the data before performing the I/O operation.

5.6.4 Use Memory for Intermediate Results

Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device. One situation that may not benefit from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system. Excessive page faults can dramatically impede virtual memory performance.

If you are primarily concerned with the CPU performance of the system, consider using a memory file system (mfs) virtual disk to hold any files your code reads or writes (see mfs(1)).

5.6.5 Enable Implied-DO Loop Collapsing

DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the Compaq Fortran RTL. The processing overhead of these calls can be most significant in implied-DO loops.

Compaq Fortran reduces the number of calls in implied-DO loops by replacing up to seven nested implied-DO loops with a single call to an optimized run-time library I/O routine. The routine can transmit many I/O elements at once.

Loop collapsing can occur in formatted and unformatted I/O, but only if certain conditions are met:

The control variable must be an integer. The control variable cannot be a dummy argument or contained in an EQUIVALENCE or VOLATILE statement. Compaq Fortran must be able to determine that the control variable does not change unexpectedly at run time.
The format must not contain a variable format expression.

For More Information:

On VOLATILE attribute and statement, see the Compaq Fortran Language Reference Manual.
On loop optimizations, see Section 5.8.

5.6.6 Use of Variable Format Expressions

Variable format expressions (a Compaq Fortran extension) are almost as flexible as run-time formatting, but they are more efficient because the compiler can eliminate run-time parsing of the I/O format. Only a small amount of processing and the actual data transfer are required during run time.

On the other hand, run-time formatting can impair performance significantly. For example, in the following statements, S₁ is more efficient than S₂ because the formatting is done once at compile time, not at run time:

S₁ WRITE (6,400) (A(I), I=1,N) 400 FORMAT (1X, <N> F5.2) . . . S₂ WRITE (CHFMT,500) '(1X,',N,'F5.2)' 500 FORMAT (A,I3,A) WRITE (6,FMT=CHFMT) (A(I), I=1,N)

5.6.7 Efficient Use of Record Buffers and Disk I/O

Records being read or written are transferred between the user's program buffers and one or more disk block I/O buffers, which are established when the file is opened by the Compaq Fortran RTL. Unless very large records are being read or written, multiple logical records can reside in the disk block I/O buffer when it is written to disk or read from disk, minimizing physical disk I/O.

You can specify the size of the disk block physical I/O buffer by using the OPEN statement BLOCKSIZE specifier; the default size can be obtained from fstat(2). If you omit the BLOCKSIZE specifier in the OPEN statement, it is set for optimal I/O use with the type of device the file resides on (with the exception of network access).

The OPEN statement BUFFERCOUNT specifier specifies the number of I/O buffers. The default for BUFFERCOUNT is 1. Any experiments to improve I/O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value, to increase the amount of data read by each disk I/O.

If the OPEN statement has BLOCKSIZE and BUFFERCOUNT specifiers, then the internal buffer size in bytes is the product of these specifiers. If the OPEN statement does not have these specifiers, then the default internal buffer size is 8192 bytes. This internal buffer will grow to hold the largest single record, but will never shrink.

The default for the Fortran run-time system is to use unbuffered disk writes. That is, by default, records are written to disk immediately as each record is written instead of accumulating in the buffer to be written to disk later.

To enable buffered writes (that is, to allow the disk device to fill the internal buffer before the buffer is written to disk), use one of the following:

The OPEN statement BUFFERED specifier
The -assume buffered_io command-line option
The FORT_BUFFERED run-time environment variable

The OPEN statement BUFFERED specifier takes precedence over the -assume buffered_io option. If neither one is set (which is the default), the FORT_BUFFERED environment variable is tested at run time.

The OPEN statement BUFFERED specifier applies to a specific logical unit. In contrast, the -assume [no]buffered_io option and the FORT_BUFFERED environment variable apply to all Fortran units.

Using buffered writes usually makes disk I/O more efficient by writing larger blocks of data to the disk less often. However, a system failure when using buffered writes can cause records to be lost, since they might not yet have been written to disk. (Such records would have been written to disk with the default unbuffered writes.)

When performing I/O across a network, be aware that the size of the block of network data sent across the network can impact application efficiency. When reading network data, follow the same advice for efficient disk reads, by increasing the BUFFERCOUNT. When writing data through the network, several items should be considered:

Unless the application requires that records be written using unbuffered writes, enable buffered writes by a method described above.
Especially with large files, increasing the BLOCKSIZE value increases the size of the block sent on the network and how often network data blocks get sent.
Time the application when using different BLOCKSIZE values under similar conditions to find the optimal network block size.

When writing records, be aware that I/O records are written to unified buffer cache (UBC) system buffers. To request that I/O records be written from program buffers to the UBC system buffers, use the flush library routine (see flush(3f) and Chapter 12). Be aware that calling flush also discards read-ahead data in user buffer.

To request that UBC system buffers be written to disk, use the fsync library routine (see fsync(3f) and Chapter 12).

When UBC buffers are written to disk depends on UBC characteristics on the system, such as the vm-ubcbuffers attribute (see the Compaq Tru64 UNIX System Tuning and Performance guide).

For More Information:

5.6.8 Specify RECL

The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).

The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).

The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl option to request 1-byte units (see Section 3.7).

When porting unformatted data files from non-Compaq systems, see Section 10.6.

5.6.9 Use the Optimal Record Type

Unless a certain record type is needed for portability reasons (see Section 7.4.3), choose the most efficient type, as follows:

For sequential files of a consistent record size, the fixed-length record type gives the best performance.
For sequential unformatted files when records are not fixed in size, the variable-length record type gives the best performance---particularly for BACKSPACE operations.
For sequential formatted files when records are not fixed in size, the Stream_LF record type gives the best performance.

5.6.10 Reading from a Redirected Standard Input File

Due to certain precautions that the Fortran run-time system takes to ensure the integrity of standard input, reads can be very slow when standard input is redirected from a file. For example, when you use a command such as myprogram.exe < myinput.data , the data is read using the READ(*) or READ(5) statement, and performance is degraded. To avoid this problem, do one of the following:

Explicitly open the file using the OPEN statement. For example:
OPEN(5, STATUS='OLD', FILE='myinput.dat')
Use an environment variable to specify the input file.
For example, if read from unit 5:
setenv FORT5=myinput.dat
or if read from unit *:
setenv FOR_READ=myinput.dat

To take advantage of these methods, be sure your program does not rely on sharing the standard input file.

For More Information:

On Compaq Fortran data files and I/O, see Chapter 7.
On OPEN statement specifiers and defaults, see Section 7.5 and the Compaq Fortran Language Reference Manual.

5.7 Additional Source Code Guidelines for Run-Time Efficiency

Other source coding guidelines can be implemented to improve run-time performance.

The amount of improvement in run-time performance is related to the number of times a statement is executed. For example, improving an arithmetic expression executed within a loop many times has the potential to improve performance, more than improving a similar expression executed once outside a loop.

5.7.1 Avoid Small Integer and Small Logical Data Items

Avoid using integer or logical data less than 32 bits, because the smallest unit of efficient access on Alpha systems is 32 bits.

Accessing a 16-bit (or 8-bit) data type can result in a sequence of machine instructions to access the data, rather than a single, efficient machine instruction for a 32-bit data item.

To minimize data storage and memory cache misses with arrays, use 32-bit data rather than 64-bit data, unless you require the greater numeric range of 8-byte integers or the greater range and precision of double precision floating-point numbers.

5.7.2 Avoid Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.

For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data:

Original Code: INTEGER I, J
I = J / 2.

Efficient Code: INTEGER I, J
I = J / 2

For applications with numerous floating-point operations, consider using the -fp_reorder option (see Section 5.9.7) if a small difference in the result is acceptable.

You can use different sizes of the same general data type in an expression with minimal or no effect on run-time performance. For example, using REAL, DOUBLE PRECISION, and COMPLEX floating-point numbers in the same floating-point arithmetic expression has minimal or no effect on run-time performance.

5.7.3 Use Efficient Data Types

In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient:

Integer (also see Section 5.7.1)
Single-precision real, expressed explicitly as REAL, REAL (KIND=4), or REAL*4
Double-precision real, expressed explicitly as DOUBLE PRECISION, REAL (KIND=8), or REAL*8
Extended-precision real, expressed explicitly as REAL (KIND=16) or REAL*16

However, keep in mind that in an arithmetic expression, you should avoid mixing integer and floating-point (REAL) data (see Section 5.7.2).

5.7.4 Avoid Using Slow Arithmetic Operators

Before you modify source code to avoid slow arithmetic operators, be aware that optimizations convert many slow arithmetic operators to faster arithmetic operators. For example, the compiler optimizes the expression H=J**2 to be H=J*J.

Consider also whether replacing a slow arithmetic operator with a faster arithmetic operator will change the accuracy of the results or impact the maintainability (readability) of the source code.

Replacing slow arithmetic operators with faster ones should be reserved for critical code areas. The following hierarchy lists the Compaq Fortran arithmetic operators, from fastest to slowest:

Addition (+), subtraction (-), and floating-point multiplication (*)
Integer multiplication (*)
Division (/)
Exponentiation (**)

5.7.5 Avoid Using EQUIVALENCE Statements

Avoid using EQUIVALENCE statements. EQUIVALENCE statements can:

Force unaligned data or cause data to span natural boundaries.
Prevent certain optimizations, including:
- Global data analysis under certain conditions (see Section 5.8.3)
- Implied-DO loop collapsing when the control variable is contained in an EQUIVALENCE statement

5.7.6 Use Statement Functions and Internal Subprograms

Whenever the Compaq Fortran compiler has access to the use and definition of a subprogram during compilation, it may choose to inline the subprogram. Using statement functions and internal subprograms maximizes the number of subprogram references that will be inlined, especially when multiple source files are compiled together at optimization level -o4 or higher.

For More Information:

See Section 5.1.2.

5.7.7 Code DO Loops for Efficiency

Minimize the arithmetic operations and other operations in a DO loop whenever possible. Moving unnecessary operations outside the loop will improve performance (for example, when the intermediate nonvarying values within the loop are not needed).

For More Information:

On loop optimizations, see Section 5.8.6 and Section 5.9.2.
On coding Compaq Fortran statements, see the Compaq Fortran Language Reference Manual.

5.8 Optimization Levels: the -On Option

Compaq Fortran performs many optimizations by default. You do not have to recode your program to use them. However, understanding how optimizations work helps you remove any inhibitors to their successful function.

Generally, Compaq Fortran increases compile time in favor of decreasing run time. If an operation can be performed, eliminated, or simplified at compile time, Compaq Fortran does so, rather than have it done at run time. The time required to compile the program usually increases as more optimizations occur.

The program will likely execute faster when compiled at -o4 , but will require more compilation time than if you compile the program at a lower level of optimization.

The size of object file varies with the optimizations requested. Factors that can increase object file size include an increase of loop unrolling or procedure inlining.

Table 5-4 lists the levels of Compaq Fortran optimization with different -o options. For example: -o0 specifies no selectable optimizations (some optimizations always occur); -o5 specifies all levels of optimizations, including loop transformation.

Table 5-4 Levels of Optimization with Different -O n Options
Optimization Type --O0 --O1 --O2 --O3 --O4 --O5

Loop transformation X

Software pipelining X X

Automatic inlining X X

Additional global optimizations X X X

Global optimizations X X X X

Local (minimal) optimizations X X X X X

The default is -o4 (same as -o ). However, if -g2 , -g , or -gen_feedback is also specified, the default is -o0 (no optimizations).

In Table 5-4, the following terms are used to describe the levels of optimization:

Local (minimal) optimizations ( -o1 or higher) occur within the source program unit and include recognition of common subexpressions and the expansion of multiplication and division. See Section 5.8.2, Local (Minimal) Optimizations.
Global optimizations ( -o2 or higher) include such optimizations as data flow analysis, code motion, strength reduction, split-lifetime analysis, and instruction scheduling. See Section 5.8.3, Global Optimizations.
Additional global optimizations ( -o3 or higher) improve speed at the cost of extra code size. These optimizations include loop unrolling, prefetching of data, and code replication to eliminate branches. See Section 5.8.4, Additional Global Optimizations.
Automatic inlining ( -o4 or higher) applies interprocedure analysis and inline expansion of small procedures, usually by using heuristics that limit extra code. See Section 5.8.5, Automatic Inlining.
Software pipelining ( -o4 or higher) applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution. Software pipelining also enables the prefetching of data to reduce the impact of cache misses.
Loop transformation ( -o5 ) optimizations apply to array references within loops and can apply to multiple nested loops. These optimizations can improve the performance of the memory system. See Section 5.8.7, Loop Transformation.

5.8.1 Optimizations Performed at All Optimization Levels

The following optimizations occur at any optimization level ( -o0 through -o5 ):

Space optimizations
Space optimizations decrease the size of the object or executing program by eliminating unnecessary use of memory, thereby improving speed of execution and system throughput. Compaq Fortran space optimizations are:
- Constant pooling
  Only one copy of a given constant value is ever allocated memory space. If that constant value is used in several places in the program, all references point to that value.
- Dead code elimination
  If operations will never execute or if data items will never be used, Compaq Fortran eliminates them. Dead code includes unreachable code and code that becomes unused as a result of other optimizations, such as value propagation.
Inlining arithmetic statement functions and intrinsic procedures
Regardless of the optimization level, Compaq Fortran inserts arithmetic statement functions directly into a program instead of calling them as functions. This permits other optimizations of the inlined code and eliminates several operations, such as calls and returns or stores and fetches of the actual arguments. For example:
SUM(A,B) = A+B . . . Y = 3.14 X = SUM(Y,3.0) ! With value propagation, becomes: X = 6.14
Most intrinsic procedures are automatically inlined.
Inlining of other subprograms, such as contained subprograms, occurs at optimization level -o4 .
Implied-DO loop collapsing
DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the Compaq Fortran RTL. The processing overhead of these calls can be most significant in implied-DO loops.
If Compaq Fortran can determine that the format will not change during program execution, it replaces the series of calls in up to seven nested implied-DO loops with a single call to an optimized RTL routine (see Section 5.6.5). The optimized RTL routine can transfer many elements in one operation.
Compaq Fortran collapses implied-DO loops in formatted and unformatted I/O operations, but it is more important with unformatted I/O, where the cost of transmitting the elements is a higher fraction of the total cost.
Array temporary elimination and FORALL statements
Certain array store operations are optimized. For example, to minimize the creation of array temporaries, Compaq Fortran can detect when no overlap occurs between the two sides of an array expression. This type of optimization occurs for some assignment statements in FORALL constructs.
Certain array operations are also candidates for loop unrolling optimizations (see Section 5.8.4.1).

5.8.2 Local (Minimal) Optimizations

To enable local optimizations, use -o1 or a higher optimization level ( -o2 , -o3 , -o4 , or -o5 ).

To prevent local optimizations, specify the -o0 option.

5.8.2.1 Common Subexpression Elimination

If the same subexpressions appear in more than one computation and the values do not change between computations, Compaq Fortran computes the result once and replaces the subexpressions with the result itself:

DIMENSION A(25,25), B(25,25) A(I,J) = B(I,J)

Without optimization, these statements can be compiled as follows:

t1 = ((J-1)*25+(I-1))*4 t2 = ((J-1)*25+(I-1))*4 A(t1) = B(t2)

Variables t1 and t2 represent equivalent expressions. Compaq Fortran eliminates this redundancy by producing the following:

t = ((J-1)*25+(I-1))*4 A(t) = B(t)

5.8.2.2 Integer Multiplication and Division Expansion

Expansion of multiplication and division refers to bit shifts that allow faster multiplication and division while producing the same result. For example, the integer expression (I*17) can be calculated as I with a 4-bit shift plus the original value of I. This can be expressed using the Compaq Fortran ISHFT intrinsic function:

J1 = I*17 J2 = ISHFT(I,4) + I ! equivalent expression for I*17

The optimizer uses machine code that, like the ISHFT intrinsic function, shifts bits to expand multiplication and division by literals.

5.8.2.3 Compile-Time Operations

Compaq Fortran does as many operations as possible at compile time rather than at run time.

Constant Operations

Compaq Fortran can perform many operations on constants (including PARAMETER constants):

Constants preceded by a unary minus sign are negated.
Expressions involving +, --, *, or / operators are evaluated; for example:
PARAMETER (NN=27) I = 2*NN+J ! Becomes: I = 54 + J
Evaluation of some constant functions and operators is performed at compile time. This includes certain functions of constants, concatenation of string constants, and logical and relational operations involving constants.
Lower-ranked constants are converted to the data type of the higher-ranked operand:
REAL X, Y X = 10 * Y ! Becomes: X = 10.0 * Y
Array address calculations involving constant subscripts are simplified at compile time whenever possible:
INTEGER I(10,10) I(1,2) = I(4,5) ! Compiled as a direct load and store

Algebraic Reassociation Optimizations

Compaq Fortran delays operations to see whether they have no effect or can be transformed to have no effect. If they have no effect, these operations are removed. A typical example involves unary minus and .NOT. operations:

X = -Y * -Z ! Becomes: Y * Z

5.8.2.4 Value Propagation

Compaq Fortran tracks the values assigned to variables and constants, including those from DATA statements, and traces them to every place they are used. Compaq Fortran uses the value itself when it is more efficient to do so.

When compiling subprograms, Compaq Fortran analyzes the program to ensure that propagation is safe if the subroutine is called more than once.

Value propagation frequently leads to more value propagation. Compaq Fortran can eliminate run-time operations, comparisons and branches, and whole statements.

In the following example, constants are propagated, eliminating multiple operations from run time:

Original Code Optimized Code

PI = 3.14 .
.
.
PIOVER2 = PI/2 .
.
.
I = 100 .
.
.
IF (I.GT.1) GOTO 10
10 A(I) = 3.0*Q .
.
.
PIOVER2 = 1.57 .
.
.
I = 100 .
.
.
10 A(100) = 3.0*Q

5.8.2.5 Dead Store Elimination

If a variable is assigned but never used, Compaq Fortran eliminates the entire assignment statement:

X = Y*Z . . .=Y*Z is eliminated. X = A(I,J)* PI

Some programs used for performance analysis often contain such unnecessary operations. When you try to measure the performance of such programs compiled with Compaq Fortran, these programs may show unrealistically good performance results. Realistic results are possible only with program units using their results in output statements.

5.8.2.6 Register Usage

A large program usually has more data that would benefit from being held in registers than there are registers to hold the data. In such cases, Compaq Fortran typically tries to use the registers according to the following descending priority list:

For temporary operation results, including array indexes
For variables
For addresses of arrays (base address)
All other usages

Compaq Fortran uses heuristic algorithms and a modest amount of computation to attempt to determine an effective usage for the registers.

Holding Variables in Registers

Because operations using registers are much faster than using memory, Compaq Fortran generates code that uses the Alpha 64-bit integer and floating-point registers instead of memory locations. Knowing when Compaq Fortran uses registers may be helpful when doing certain forms of debugging.

Compaq Fortran uses registers to hold the values of variables whenever the Fortran language does not require them to be held in memory, such as holding the values of temporary results of subexpressions, even if -o0 (no optimization) was specified.

Compaq Fortran may hold the same variable in different registers at different points in the program:

V = 3.0*Q . . . X = SIN(Y)*V . . . V = PI*X . . . Y = COS(Y)*V

Compaq Fortran may choose one register to hold the first use of V and another register to hold the second. Both registers can be used for other purposes at points in between. There may be times when the value of the variable does not exist anywhere in the registers. If the value of V is never needed in memory, it is never assigned.

Compaq Fortran uses registers to hold the values of I, J, and K (so long as there are no other optimization effects, such as loops involving the variables):

A(I) = B(J) + C(K)

More typically, an expression uses the same index variable:

A(K) = B(K) + C(K)

In this case, K is loaded into only one register and is used to index all three arrays at the same time.

5.8.2.7 Mixed Real/Complex Operations

In mixed REAL/COMPLEX operations, Compaq Fortran avoids the conversion and performs a simplified operation on:

Add (+), subtract (--), and multiply (*) operations if either operand is REAL
Divide (/) operations if the right operand is REAL

For example, if variable R is REAL and A and B are COMPLEX, no conversion occurs with the following:

COMPLEX A, B . . . B = A + R

5.8.3 Global Optimizations

To enable global optimizations, use -o2 or a higher optimization level ( -o3 , -o4 , or -o5 ). Using -o2 or higher also enables local optimizations ( -o1 ).

Global optimizations include:

Data flow analysis
Split lifetime analysis
Strength reduction (replaces a CPU-intensive calculation with one that uses fewer CPU cycles)
Code motion (also called code hoisting)
Instruction scheduling

Data flow analysis and split lifetime analysis (global data analysis) traces the values of variables and whole arrays as they are created and used in different parts of a program unit. During this analysis, Compaq Fortran assumes that any pair of array references to a given array might access the same memory location, unless a constant subscript is used in both cases.

To eliminate unnecessary recomputations of invariant expressions in loops, Compaq Fortran hoists them out of the loops so they execute only once.

Global data analysis includes which data items are selected for analysis. Some data items are analyzed as a group and some are analyzed individually. Compaq Fortran limits or may disqualify data items that participate in the following constructs, generally because it cannot fully trace their values:

VOLATILE declarations
VOLATILE declarations are needed to use certain run-time features of the operating system. Declare a variable as VOLATILE if the variable can be accessed using rules in addition to those provided by the Fortran 95/90 language. Examples include:
- COMMON data items or entire common blocks that can change value by means other than direct assignment or during a routine call. For such applications, you must declare the variable or the COMMON block to which it belongs as volatile.
- An address not saved by the %LOC built-in function.
- Variables read or written by a signal handler, including those in a common block or module.
As requested by the VOLATILE statement, Compaq Fortran disqualifies any volatile variables from global data analysis.
Subroutine calls or external function references
Compaq Fortran cannot trace data flow in a called routine that is not part of the program unit being compiled, unless the same f90 command compiled multiple program units (see Section 5.1.2). Arguments passed to a called routine that are used again in a calling program are assumed to be modified, unless the proper INTENT is specified in an interface block (the compiler must assume they are referenced by the called routine).
Common blocks
Compaq Fortran limits optimizations on data items in common blocks. If common block data items are referenced inside called routines, their values might be altered. In the following example, variable I might be altered by FOO, so Compaq Fortran cannot predict its value in subsequent references.
COMMON /X/ I DO J=1,N I = J CALL FOO A(I) = I ENDDO
Variables in Fortran 95/90 modules
Compaq Fortran limits optimizations on variables in Fortran 95/90 modules. Like common blocks, if the variables in Fortran 95/90 modules are referenced inside called routines, their values might be altered.
Variables referenced by a %LOC built-in function or variables with the TARGET attribute
Compaq Fortran limits optimizations on variables indirectly referenced by a %LOC function or on variables with the TARGET attribute, because the called routine may dereference a pointer to such a variable.
Equivalence groups
An equivalence group is formed explicitly with the EQUIVALENCE statement or implicitly by the COMMON statement. A program section is a particular common block or local data area for a particular routine. Compaq Fortran combines equivalence groups within the same program section and in the same program unit.
The equivalence groups in separate program sections are analyzed separately, but the data items within each group are not, so some optimizations are limited to the data within each group.

5.8.4 Additional Global Optimizations

To enable additional global optimizations, use -o3 or a higher optimization level ( -o4 or -o5 ). Using -o3 or higher also enables local optimizations ( -o1 ) and global optimizations ( -o2 ).

Additional global optimizations improve speed at the cost of longer compile times and possibly extra code size.

5.8.4.1 Loop Unrolling

At optimization level -o3 or above, Compaq Fortran attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.

As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.

The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.

The loop unroller also inserts data prefetches for arrays with affine subscripts. Prefetches (that is, prefetch instructions) can be inserted even if the unroller chooses not to unroll. On some architectures (21264 and later), write-hint instructions are also generated.

The number of times a loop is unrolled can be determined either by the optimizer or by using the -unroll num option, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer will choose an unroll amount that minimizes the overhead of prefetching while also limiting code size expansion.

Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:

A(1:100,2:30) = B(1:100,1:29) * 2.0

For More Information:

See Section 3.94, -unroll num --- Specify Number for Loop Unroll Optimization.

5.8.4.2 Code Replication to Eliminate Branches

In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.

Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.

For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R0 is register 0):

Unoptimized Instructions Optimized (Replicated) Instructions

.
.
.
branch to exit1
.
.
.
branch to exit1
.
.
.
exit1: move 1 into R0
return

.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return

Similarly, code replication can also occur within a loop that contains a small amount of shared code at the bottom of a loop and a case-type dispatch within the loop. The loop-end test-and-branch code might be replicated at the end of each case to create efficient instruction pipelining within the code for each case.

5.8.5 Automatic Inlining

To enable optimizations that perform automatic inlining, use -o4 or a higher optimization level ( -o5 ). Using -o4 also enables local optimizations ( -o1 ), global optimizations ( -o2 ), and additional global optimizations ( -o3 ).

The default is -o4 (unless -g2 , -g , or -gen_feedback is specified).

5.8.5.1 Interprocedure Analysis

Compiling multiple source files at optimization level -o4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:

Inlining more procedures
More complete global data analysis
Reducing the number of external references to be resolved during linking

As more procedures are inlined, the size of the executable program and compile times may increase, but execution time should decrease.

5.8.5.2 Inlining Procedures

Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.

The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:

Estimated size of code
Number of call sites
Use of constant arguments

You can specify:

One of the -on options to control the optimization level. For example, specifying -o4 or higher enables interprocedure optimizations.
Different -on options set -inline xxxx options. For example, -o4 sets -inline speed .
One of the -inline xxxx options to directly control the inlining of procedures (see Section 5.9.3). For example, -inline speed inlines more procedures than -inline size .

5.8.6 Software Pipelining

Software pipelining and additional software dependence analysis are enabled by using the -pipeline option, the -o4 option, or the -o5 option. Software pipelining in certain cases improves run-time performance.

Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Software pipelining also includes associated additional software dependence analysis and enables the prefetching of data to reduce the impact of cache misses.

Loop unrolling (enabled at -o3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.

For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.

Software pipelining can be more effective when you combine -pipeline (or -o4 or -o5 ) with the appropriate -tune keyword for the target Alpha processor generation (see Section 5.9.4).

To specify software pipelining without loop transformation optimizations, do one of the following:

Specify -o5 with -notransform_loops (preferred method)
Specify -o4
Specify -pipeline with -o3 or -o2

This optimization is not performed at optimization levels below -o2 .

Loops chosen for software pipelining:

Are always innermost loops (those executed the most).
Do not contain branches or procedure calls.
Do not use COMPLEX floating-point data.

By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using -o4 or -o5 (or -pipeline ) may not improve performance, as compared to using -o3 .

This option might increase compilation time and/or program size. For programs that contain loops that exhaust available registers, longer execution times may occur. In this case, specify options -unroll 1 or -unroll 2 with the -pipeline option.

To determine whether using -pipeline benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without software pipelining (such as with -pipeline and -nopipeline ).

For programs that contain loops that exhaust available registers, longer execution times may result with -o4 or -o5 , requiring use of -unroll n to limit loop unrolling (see Section 3.94).

For More Information:

On the interaction of command-line options and timing programs compiled with software pipelining, see Section 3.76.

5.8.7 Loop Transformation

The loop transformation optimizations are enabled by using the -transform_loops option or the -o5 option. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.

To request loop transformation optimizations without software pipelining, do one of the following:

Specify -o5 with -nopipeline (preferred method)
Specify -transform_loops with -o4 , -o3 , or -o2

This optimization is not performed at optimization levels below -o2 .

You must specify -notransform_loops if you want this type of optimization disabled and you are also specifying -o5 .

The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops.

The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number of iterations before entering the loop. For example, DO and IF loops are normally counted loops, but uncounted DO WHILE loops are not.

Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.

The types of optimizations associated with -transform_loops include the following:

Loop blocking---Can minimize memory system use with multidimensional array elements by completing as many operations as possible on array elements currently in the cache. Also known as loop tiling.
Loop distribution---Moves instructions from one loop into separate, new loops. This can reduce the amount of memory used during one loop so that the remaining memory may fit in the cache. It can also create improved opportunities for loop blocking.
Loop fusion---Combines instructions from two or more adjacent loops that use some of the same memory locations into a single loop. This can avoid the need to load those memory locations into the cache multiple times and improves opportunities for instruction scheduling.
Loop interchange---Changes the nesting order of some or all loops. This can minimize the stride of array element access during loop execution and reduce the number of memory accesses needed. Also known as loop permutation.
Scalar replacement---Replaces the use of an array element with a scalar variable under certain conditions.
Outer loop unrolling---Unrolls the outer loop inside the inner loop under certain conditions to minimize the number of instructions and memory accesses needed. This also improves opportunities for instruction scheduling and scalar replacement.

To determine whether using -transform_loops benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without loop transformation optimizations (such as with -transform_loops and -notransform_loops ).

For More Information:

See Section 3.89, -transform_loops --- Activate Loop Transformation Optimizations.

5.9 Other Options Related to Optimization

In addition to the -on options (discussed in Section 5.8), several other f90 command options can prevent or facilitate improved optimizations.

5.9.1 Setting Multiple Options with the -fast Option

Specifying the -fast option sets many performance options. For details, see Section 3.40, -fast --- Set Options to Improve Run-Time Performance.

5.9.2 Controlling the Number of Times a Loop Is Unrolled

You can specify the number of times a loop is unrolled by using the -unroll num option (see Section 3.94).

The -unroll num option can also influence the run-time results of software pipelining optimizations performed when you specify one of the following:

-o5
-o4
-pipeline with -o3 or -o2

Although unrolling loops usually improves run-time performance, the size of the executable program may increase.

For More Information:

See Section 5.8.4.1, Loop Unrolling.

5.9.3 Controlling the Inlining of Procedures

To specify the types of procedures to be inlined, use the -inline keyword option. Also, compile multiple source files together and specify an adequate optimization level, such as -o4 .

If you omit -noinline and the -inline keyword options, the optimization level -on option used determines the types of procedures that are inlined.

Maximizing the types of procedures that are inlined usually improves run-time performance, but compile-time memory usage and the size of the executable program may increase.

To determine whether using -inline all benefits your particular program, time program execution for the same program compiled with and without -inline all .

For More Information:

5.9.4 Requesting Optimized Code for a Specific Processor Generation

You can specify the types of optimized code to be generated by using the -tune keyword and -arch keyword options. Regardless of the specified keyword, the generated code will run correctly on all implementations of the Alpha architecture. Tuning for a specific implementation can improve run-time performance; it is also possible that code tuned for a specific target may run slower on another target.

Specifying the correct keyword for -tune keyword for the target processor generation type usually slightly improves run-time performance. Unless you request software pipelining, the run-time performance difference for using the wrong keyword for -tune keyword (such as using -tune ev4 for an ev5 processor) is usually less than 5%. When using software pipelining (using -o4 or -o5 ) with -tune keyword , the difference can be more than 5%.

The combination of the specified keyword for -tune keyword and the type of processor generation used has no effect on producing the expected correct program results.

For More Information:

See Section 3.90, -tune keyword --- Specify Alpha Processor Implementation.

5.9.5 Requesting the Speculative Execution Optimization

(TU*X ONLY) Speculative execution reduces instruction latency stalls to improve run-time performance for certain programs or routines. Speculative execution evaluates conditional code (including exceptions) and moves instructions that would otherwise be executed conditionally to a position before the test, so they are executed unconditionally.

The default, -speculate none , means that the speculative execution code scheduling optimization is not used and exceptions are reported as expected. You can specify -speculate all or -speculate by_routine to request the speculative execution optimization.

Performance improvements may be reduced because the run-time system must dismiss exceptions caused by speculative instructions. For certain programs, longer execution times may result when using the speculative execution optimization. To determine whether using -speculate all or -speculate by_routine benefits your particular program, you should time the program execution with one of these options for the same program compiled with -speculate none (default).

Speculative execution does not support some run-time error checking, since exception and signal processing (including SIGSEGV, SIGBUS, and SIGFPE) is conditional. When the program needs to be debugged or while you are testing for errors, only use -speculate none .

For More Information:

On -speculate all or -speculate by_routine and the interaction with other command-line options, see Section 3.84.

5.9.6 Request Nonshared Object Optimizations

When you specify -non_shared to request a nonshared object file, you can specify the -om option to request code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window.

For More Information:

On the -WL , arg command-line options that enable nonshared object file code optimizations, see Section 3.73.

5.9.7 Arithmetic Reordering Optimizations

If you use the -fp_reorder option (or -assume noaccuracy_sensitive , which are equivalent), Compaq Fortran may reorder code (based on algebraic identities) to improve performance.

For example, the following expressions are mathematically equivalent but may not compute the same value using finite precision arithmetic:

X = (A + B) + C X = A + (B + C)

The results can be slightly different from the default -no_fp_reorder because of the way intermediate results are rounded. However, the -no_fp_reorder results are not categorically less accurate than those gained by the default. In fact, dot product summations using -fp_reorder can produce more accurate results than those using -no_fp_reorder .

The effect of -fp_reorder is important when Compaq Fortran hoists divide operations out of a loop. If -fp_reorder is in effect, the unoptimized loop becomes the optimized loop:

Unoptimized Code Optimized Code

T = 1/V

DO I=1,N DO I=1,N

. .

. .

. .

B(I) = A(I)/V B(I) = A(I)*T

END DO END DO

The transformation in the optimized loop increases performance significantly, and loses little or no accuracy. However, it does have the potential for raising overflow or underflow arithmetic exceptions.

The compiler can also reorder code based on algebraic identities to improve performance if you specify -fast .

5.9.8 Dummy Aliasing Assumption

Some programs compiled with Compaq Fortran (or Compaq Fortran 77) may have results that differ from the results of other Fortran compilers. Such programs may be aliasing dummy arguments to each other or to a variable in a common block or shared through use association, and at least one variable access is a store.

This program behavior is prohibited in programs conforming to the Fortran 95/90 standards, but not by Compaq Fortran. Other versions of Fortran allow dummy aliases and check for them to ensure correct results. However, Compaq Fortran assumes that no dummy aliasing will occur, and it can ignore potential data dependencies from this source in favor of faster execution.

The Compaq Fortran default is safe for programs conforming to the Fortran 95/90 standards. It will improve performance of these programs, because the standard prohibits such programs from passing overlapped variables or arrays as actual arguments if either is assigned in the execution of the program unit.

The -assume dummy_aliases option allows dummy aliasing. It ensures correct results by assuming the exact order of the references to dummy and common variables is required. Program units taking advantage of this behavior can produce inaccurate results if compiled with -assume nodummy_aliases .

Example 5-1 is taken from the DAXPY routine in the Fortran-77 version of the Basic Linear Algebra Subroutines (BLAS).

Example 5-1 Using the -assume dummy_aliases Option

SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY) C Constant times a vector plus a vector. C uses unrolled loops for increments equal to 1. DOUBLE PRECISION DX(1), DY(1), DA INTEGER I,INCX,INCY,IX,IY,M,MP1,N C IF (N.LE.0) RETURN IF (DA.EQ.0.0) RETURN IF (INCX.EQ.1.AND.INCY.EQ.1) GOTO 20 C Code for unequal increments or equal increments C not equal to 1. . . . RETURN C Code for both increments equal to 1. C Clean-up loop 20 M = MOD(N,4) IF (M.EQ.0) GOTO 40 DO I=1,M DY(I) = DY(I) + DA*DX(I) END DO IF (N.LT.4) RETURN 40 MP1 = M + 1 DO I = MP1, N, 4 DY(I) = DY(I) + DA*DX(I) DY(I + 1) = DY(I + 1) + DA*DX(I + 1) DY(I + 2) = DY(I + 2) + DA*DX(I + 2) DY(I + 3) = DY(I + 3) + DA*DX(I + 3) END DO RETURN END SUBROUTINE

The second DO loop contains assignments to DY. If DY is overlapped with DA, any of the assignments to DY might give DA a new value, and this overlap would affect the results. If this overlap is desired, then DA must be fetched from memory each time it is referenced. The repetitious fetching of DA degrades performance.

Linking Routines with Opposite Settings

You can link routines compiled with the -assume dummy_aliases option to routines compiled with -assume nodummy_aliases . For example, if only one routine is called with dummy aliases, you can use -assume dummy_aliases when compiling that routine, and compile all the other routines with -assume nodummy_aliases to gain the performance value of that option.

Programs calling DAXPY with DA overlapping DY do not conform to the FORTRAN-77 and Fortran 95/90 standards. However, they are supported if -assume dummy_aliases was used to compile the DAXPY routine.

Contents

Index

Original Code:	`INTEGER I, J` `I = J / 2.`
Efficient Code:	`INTEGER I, J` `I = J / 2`

Optimization Type	--O1	--O2	--O3	--O4	--O5
Loop transformation					X
Software pipelining				X	X
Automatic inlining				X	X
Additional global optimizations			X	X	X
Global optimizations		X	X	X	X
Local (minimal) optimizations	X	X	X	X	X

Unoptimized Code	Optimized Code
	`T = 1/V`
`DO I=1,N`	`DO I=1,N`
.	.
.	.
.	.
`B(I) = A(I)/V`	`B(I) = A(I)*T`
`END DO`	`END DO`

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems

Chapter 5Performance: Making Programs Run Faster

5.1 Efficient Compilation and the Software Environment

5.1.1 Install the Latest Version of Compaq Fortran and Performance Products

5.1.2 Compile Using Multiple Source Files and Appropriate f90 Options

5.1.3 Process Shell Environment and Related Influences on Performance

5.2 Using the time Command to Measure Performance

5.3 Using Profiling Tools

5.3.2 Call Graph Sampling (gprof)

5.3.3 Basic Block Counting (pixie and prof)

5.3.4 Source Line CPU Cycle Use (prof and pixie)

5.4 Data Alignment Considerations

5.4.2 Checking for Inefficient Unaligned Data

5.4.3 Ordering Data Declarations to Avoid Unaligned Data

5.4.3.3 Arranging Data Items in Compaq Fortran Record Structures

5.4.4 Options Controlling Alignment

5.5.1 Accessing Arrays Efficiently

5.5.2 Passing Array Arguments Efficiently

5.6.6 Use of Variable Format Expressions

5.6.8 Specify RECL

5.6.10 Reading from a Redirected Standard Input File

5.7 Additional Source Code Guidelines for Run-Time Efficiency

5.7.5 Avoid Using EQUIVALENCE Statements

5.7.6 Use Statement Functions and Internal Subprograms

5.7.7 Code DO Loops for Efficiency

5.8 Optimization Levels: the -On Option

5.8.1 Optimizations Performed at All Optimization Levels

5.8.2 Local (Minimal) Optimizations

5.8.4 Additional Global Optimizations

5.8.4.2 Code Replication to Eliminate Branches

5.8.6 Software Pipelining

5.8.7 Loop Transformation

5.9 Other Options Related to Optimization

5.9.3 Controlling the Inlining of Procedures

5.9.4 Requesting Optimized Code for a Specific Processor Generation

5.9.5 Requesting the Speculative Execution Optimization

5.9.6 Request Nonshared Object Optimizations

5.9.7 Arithmetic Reordering Optimizations

Compaq Fortran
User Manual for
Tru64 UNIX and
Linux Alpha Systems

Chapter 5
Performance: Making Programs Run Faster