Previous | Contents | Index |
This chapter contains the following topics:
To invoke the Compaq Fortran compiler, use:
This chapter uses f90 to indicate invoking Compaq Fortran on both systems, so replace this command with fort if you are working on a Linux Alpha system. To invoke the Compaq C compiler, use:
This chapter uses cc to indicate invoking Compaq C on both systems, so replace this command with ccc if you are working on a Linux Alpha system. |
Before you attempt to analyze and improve program performance, you should:
To ensure that your software development environment can significantly improve the run-time performance of your applications, obtain and install the following optional software products:
http://www.compaq.com/fortran |
http://www.compaq.com/hpc/software/kap.html |
% kf90 -fkapargs='-lc=blas' for_cal.f90 -lcxml |
For More Information:
For More Information:
During the earlier stages of program development, you can use incremental compilation with minimal optimization. For example:
% f90 -c -O1 sub2.f90 % f90 -c -O1 sub3.f90 % f90 -o main.out -g -O0 main.f90 sub2.o sub3.o |
During the later stages of program development, you should specify multiple source files together and use an optimization level of at least -o4 on the f90 command line to allow more interprocedure optimizations to occur. For instance, the following command compiles all three source files together using the default level of optimization ( -o4 ):
% f90 -o main.out main.f90 sub2.f90 sub3.f90 |
Compiling multiple source files lets the compiler examine more code for possible optimizations, which results in:
For very large programs, compiling all source files together may not be practical. In such instances, consider compiling source files containing related routines together using multiple f90 commands, rather than compiling source files individually.
Table 5-1 shows f90 options that can improve performance. Most of these options do not affect the accuracy of the results, while others improve run-time performance but can change some numeric results.
Compaq Fortran performs certain optimizations unless you specify the appropriate f90 command options. Additional optimizations can be enabled or disabled using f90 command options.
Table 5-1 lists the f90 options that can directly improve run-time performance.
Option Names | Description | For More Information |
---|---|---|
-align keyword | Controls whether padding bytes are added between data items within common blocks, derived-type data, and Compaq Fortran record structures to make the data items naturally aligned. | Section 5.4 |
-architecture keyword | Determines the type of Alpha architecture code instructions to be generated for the program unit being compiled. All Alpha processors implement a core set of instructions; certain processor versions include additional instruction extensions. | Section 3.5 |
-cord and -feedback file | Uses a feedback file created during a previous compilation by specifying the -gen_feedback option. These options use the feedback file to improve run-time performance, optionally using cord to rearrange procedures. | Section 5.3.5 |
-fast |
Sets the following performance-related options:
-align dcommons |
See description of each option |
-fp_reorder | Allows the compiler to reorder code based on algebraic identities to improve performance, enabling certain optimizations. The numeric results can be slightly different from the default ( -no_fp_reorder ) because of the way intermediate results are rounded. This slight difference in numeric results is acceptable to most programs. | Section 5.9.7 |
-gen_feedback | Requests generated code that allows accurate feedback information for subsequent use of the -feedback file option (optionally with cord ). Using -gen_feedback changes the default optimization level from -o4 to -o0 . | Section 5.3.5 |
-hpf num and related options (TU*X ONLY) | Specifies that the code generated for this program will allow parallel execution on multiple processors | Section 3.50 |
-inline all | Inlines every call that can possibly be inlined while generating correct code. Certain recursive routines are not inlined to prevent infinite loops. | Section 5.9.3 |
-inline speed | Inlines procedures that will improve run-time performance with a likely significant increase in program size. | Section 5.9.3 |
-inline size | Inlines procedures that will improve run-time performance without a significant increase in program size. This type of inlining occurs at optimization level -o4 and -o5 . | Section 5.9.3 |
-math_library fast | Requests the use of certain math library routines (used by intrinsic functions) that provide faster speed. Using this option causes a slight loss of accuracy and provides less reliable arithmetic exception checking to get significant performance improvements in those functions. | Section 3.61 |
-mp (TU*X ONLY) | Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems | Section 3.64 |
-o n ( -o0 to -o5 ) | Controls the optimization level and thus the types of optimization performed. The default optimization level is -o4 , unless you specify -g2 , -g , or -gen_feedback , which changes the default to -o0 (no optimizations). Use -o5 to activate loop transformation optimizations. | Section 5.8 |
-om (TU*X ONLY) | Used with the -non_shared option to request certain code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window. | Section 3.73 |
-omp (TU*X ONLY) | Enables parallel processing using directed decomposition (directives inserted in source code). This can improve the performance of certain programs running on shared memory multiprocessor systems | Section 3.74 |
-p , -p1 | Requests profiling information, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. | Section 5.3 |
-pg | Requests profiling information for the gprof tool, which you can use to identify those parts of your program where improving source code efficiency would most likely improve run-time performance. After you modify the appropriate source code, recompile the program and test the run-time performance. | Section 5.3 |
-pipeline | Activates the software pipelining optimization (a subset of -o4 ). | Section 3.76 |
-speculate keyword (TU*X ONLY) | Enables the speculative execution optimization, a form of instruction scheduling for conditional expressions. | Section 3.84 |
-transform_loops | Activates a group of loop transformation optimizations (a subset of -o5 ). | Section 3.89 |
-tune keyword | Specifies the target processor generation (chip) architecture on which the program will be run, allowing the optimizer to make decisions about instruction tuning optimizations needed to create the most efficient code. Keywords allow specifying one particular Alpha processor generation type, multiple processor generation types, or the processor generation type currently in use during compilation. Regardless of the setting of -tune keyword , the generated code will run correctly on all implementations of the Alpha architecture. | Section 5.9.4 |
-unroll num | Specifies the number of times a loop is unrolled ( num) when specified with optimization level -o3 or higher. If you omit -unroll num , the optimizer determines how many times loops are unrolled. | Section 5.8.4.1 |
Table 5-2 lists options that can slow program performance. Some applications that require floating-point exception handling or rounding might need to use the -fpen and -fprm dynamic options. Other applications might need to use the -assume dummy_aliases or -vms options for compatibility reasons. Other options listed in Table 5-2 are primarily for troubleshooting or debugging purposes.
Option Names | Description | For More Information |
---|---|---|
-assume dummy_aliases |
Forces the compiler to assume that dummy (formal) arguments to
procedures share memory locations with other dummy arguments or with
variables shared through use association, host association, or common
block use. These program semantics slow performance, so you should
specify
-assume dummy_aliases
only for the called subprograms that depend on such aliases.
The use of dummy aliases violates the FORTRAN-77 and Fortran 95/90 standards but occurs in some older programs. |
Section 5.9.8 |
-c | If you use -c when compiling multiple source files, also specify -o output to compile many source files together into one object file. Separate compilations prevent certain interprocedure optimizations, such as when using multiple f90 commands or using -c without the -o output option. | Section 2.1.6 |
-check bounds | Generates extra code for array bounds checking at run time. | Section 3.23 |
-check omp_bindings (TU*X ONLY) | Provides run-time checking to enforce the binding rules for OpenMP Fortran API (parallel processing) compiler directives inserted in source code. | Section 3.26 |
-check overflow | Generates extra code to check integer calculations for arithmetic overflow at run time. Once the program is debugged, omit this option to reduce executable program size and slightly improve run-time performance. | Section 3.28 |
-fpe n values greater than -fpe0 | Using -fpe1 (TU*X ONLY) , -fpe2 (TU*X ONLY) , -fpe3 , or -fpe4 (TU*X ONLY) (or using the for_set_fpe routine to set equivalent exception handling) slows program execution. For programs that specify -fpe3 or -fpe4 (TU*X ONLY) , the impact on run-time performance can be significant. | Section 3.44 |
-fprm dynamic (TU*X ONLY) | Certain rounding modes and changing the rounding mode can slow program execution slightly. | Section 3.46 |
-g , -g2 , -g3 | Generates extra symbol table information in the object file. Specifying -g or -g2 also reduces the default level of optimization to -o0 . | Section 3.48 |
-inline none
-inline manual |
Prevents the inlining of all procedures (except statement functions). | Section 5.9.3 |
-o0 , -o1 , -o2 , or -o3 | Minimizes the optimization level (and types of optimizations). Use during the early stages of program development or when you will use the debugger. | Section 3.72 and Section 5.8 |
-synchronous_exceptions | Generates extra code to associate an arithmetic exception with the instruction that causes it, slowing efficient instruction execution. Use this option only when troubleshooting, such as when identifying the source of an exception. | Section 3.86 |
-vms | Controls certain VMS-related run-time defaults, including alignment. If you specify the -vms option, you may need to also specify the -align records option to obtain optimal run-time performance. | Section 3.98 |
Certain shell commands and system tuning can improve run-time performance:
# myprog > results.lis # more results.lis |
Use the time command to provide information about program performance.
Run program timings when other users are not active. Your timing results can be affected by one or more CPU-intensive processes also running while doing your timings.
Try to run the program under the same conditions each time to provide the most accurate results, especially when comparing execution times of a previous version of the same program. Use the same CPU system (model, amount of memory, version of the operating system, and so on) if possible.
If you do need to change systems, you should measure the time using the same version of the program on both systems, so you know each system's effect on your timings.
For programs that run for less than a few seconds, run several timings to ensure that the results are not misleading. Overhead functions like loading shared libraries might influence short timings considerably.
Using the form of the time command that specifies the name of the executable program provides the following:
In the following example timings, the sample program being timed displays the following line:
Average of all the numbers is: 4368488960.000000 |
Using the Bourne shell, the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:
$ time a.out Average of all the numbers is: 4368488960.000000 real 0m2.46s user 0m0.61s sys 0m0.58s |
Using the C shell, the following program timing reports 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use), about 4 seconds (0:04) of elapsed time, the use of 28% of available CPU time, and other information:
% time a.out Average of all the numbers is: 4368488960.000000 0.61u 0.58s 0:04 28% 78+424k 9+5io 0pf+0w |
Using the bash shell (L*X ONLY), the following program timing reports that the program uses 1.19 seconds of total actual CPU time (0.61 seconds in actual CPU time for user program use and 0.58 seconds of actual CPU time for system use) and 2.46 seconds of elapsed time:
[user@system user]$ time ./a.out Average of all the numbers is: 4368488960.000000 elapsed 0m2.46s user 0m0.61s sys 0m0.58s |
Timings that show a large amount of system time may indicate a lot of time spent doing I/O, which might be worth investigating.
If your program displays a lot of text, you can redirect the output from the program on the time command line. (See Section 5.1.3.) Redirecting output from the program will change the times reported because of reduced screen I/O.
For more information, see time(1).
In addition to the time command, you might consider modifying the program to call routines within the program to measure execution time. For example:
To generate profiling information, use the f90 compiler and the prof , gprof , and pixie (TU*X ONLY) tools.
Profiling identifies areas of code where significant program execution time is spent. Along with the f90 command, use the prof and pixie (TU*X ONLY) tools to generate the following profile information:
Once you have determined those sections of code where most of the program execution time is spent, examine these sections for coding efficiency. Suggested guidelines for improving source code efficiency are provided in Section 5.7.
Along with profiling, you can consider generating a listing file with
annotations of optimizations, by specifying the
-V
and
-annotations
options.
5.3.1 Program Counter Sampling (prof)
To obtain program counter sampling data, perform the following steps:
% f90 -p -O3 -o profsample profsample.f90 |
% f90 -c -O3 profsample.f90 % f90 -p -O3 -o profsample profsample.o |
% profsample |
% prof profsample mon.out |
You can limit the report created by prof by using prof command options, such as -only , -exclude , or -quit .
For example, if you only want reports on procedures calc_max and calc_min , you could use the following command line to read the profile data file named mon.out :
% prof -only calc_max -only calc_min profsample |
The time spent in particular areas of code is reported by prof in the form of a percentage of the total CPU time spent by the program. To reduce the size of the report, you can either:
When you use the -only or -exclude options, the percentages are still based on all procedures of the application. To obtain percentages calculated by prof that are based on only those procedures included in the report, use the -only and -exclude options (use an uppercase initial letter in the option name).
You can use the -quit option to reduce the amount of information reported. For example, the following command prints information on only the five most time-consuming procedures:
% prof -quit 5 profsample |
The following command limits information only to those procedures using 10% or more of the total execution time:
% prof -quit 10% profsample |
For More Information:
To obtain call graph information, use the gprof tool. Perform the following steps:
% f90 -pg -O3 -o profsample profsample.for |
% f90 -pg -c -O3 profsample.f90 % f90 -pg -O3 -o profsample profsample.f90 |
% profsample |
% gprof profsample gmon.out |
The output produced by gprof includes:
For More Information:
To obtain basic block counting information, perform the following steps:
% f90 -O3 -o profsample profsample.f90 |
% atom -tools pixie profsample |
% profsample.pixie |
% prof -pixie profsample |
To create multiple profile data files, run the program multiple times.
For More Information:
You use the same files created by the pixie command (see Section 5.3.3) for basic block counting to estimate the number of CPU cycles used to execute each source file line.
To view a report of the number of CPU cycles estimated for each source file line, use the following options with the prof command:
Depending on the level of optimization chosen, certain source lines might be optimized away.
The CPU cycle use estimates are based primarily on the instruction type and its operands and do not include memory effects such as cache misses or translation buffer fills.
For example, the following command sequence uses:
% f90 -o profsample profsample.f90 % atom -tools pixie profsample % profsample.pixie % prof -pixie -heavy -only calc_max profsample |
You can create a feedback file by using a series of commands. Once created, you can specify a feedback file in a subsequent compilation with the f90 command option -feedback . You can also request that cord use the feedback file to rearrange procedures, by specifying the -cord option on the f90 command line.
To create the feedback file, complete these steps:
% f90 -o profsample -gen_feedback profsample.f90 |
% pixie profsample |
% profsample.pixie |
% prof -pixie -feedback profsample.feedback profsample |
You can use the feedback file as input to the f90 compiler:
% f90 -feedback profsample.feedback -o profsample profsample.f90 |
The feedback file provides the compiler with actual execution information, which the compiler can use to improve such optimizations as inlining function calls.
Specify the desired optimization level ( -on option) for the f90 command with the -feedback name option (in this example the default is -o4 ).
You can use the feedback file as input to the f90 compiler and cord , as follows:
% f90 -cord -feedback profsample.feedback -o profsample profsample.f90 |
The
-cord
option invokes
cord
, which reorders the procedures in an executable program to improve
program execution, using the information in the specified feedback
file. Specify the desired optimization level (
-on
option) for the
f90
command with the
-feedback name
option (in this example
-o4
).
5.3.6 Atom Toolkit
(TU*X ONLY) The Atom toolkit includes a programmable instrumentation tool and several prepackaged tools. The prepackaged tools include:
To invoke atom tools, use the following general command syntax:
% atom -tool tool-name ...) |
Atom does not work on programs built with the -om option.
For More Information:
For optimal performance on Alpha systems, make sure your data is aligned naturally.
A natural boundary is a memory address that is a multiple of the data item's size (data type sizes are described in Table 9-1). For example, a REAL (KIND=8) data item aligned on natural boundaries has an address that is a multiple of 8. An array is aligned on natural boundaries if all of its elements are.
All data items whose starting address is on a natural boundary are naturally aligned. Data not aligned on a natural boundary is called unaligned data.
Although the Compaq Fortran compiler naturally aligns individual data items when it can, certain Compaq Fortran statements (such as EQUIVALENCE) can cause data items to become unaligned (see Section 5.4.1).
Although you can use the
f90
command
-align keyword
options to ensure naturally aligned data, you should check and consider
reordering data declarations of data items within common blocks and
structures. Within each common block, derived type, or record
structure, carefully specify the order and sizes of data declarations
to ensure naturally aligned data. Start with the largest size numeric
items first, followed by smaller size numeric items, and then
nonnumeric (character) data.
5.4.1 Causes of Unaligned Data and Ensuring Natural Alignment
Common blocks (COMMON statement), derived-type data, and Compaq Fortran 77 record structures (RECORD statement) usually contain multiple items within the context of the larger structure.
The following declaration statements can force data to be unaligned:
To avoid unaligned data in a common block, derived-type data, or record structure (extension), use one or both of the following:
Other possible causes of unaligned data include unaligned actual arguments and arrays that contain a derived-type structure or Compaq Fortran record structure.
When actual arguments from outside the program unit are not naturally aligned, unaligned data access will occur. Compaq Fortran assumes all passed arguments are naturally aligned and has no information at compile time about data that will be introduced by actual arguments during program execution.
For arrays where each array element contains a derived-type structure or Compaq Fortran record structure, the size of the array elements may cause some elements (but not the first) to start on an unaligned boundary.
Even if the data items are naturally aligned within a derived-type structure without the SEQUENCE statement or a record structure, the size of an array element might require use of f90 -align options to supply needed padding to avoid some array elements being unaligned.
If you specify -align norecords or specify -vms without -align records , no padding bytes are added between array elements. If array elements each contain a derived-type structure with the SEQUENCE statement, array elements are packed without padding bytes regardless of the f90 command options specified. In this case, some elements will be unaligned.
When -align records option is in effect, the number of padding bytes added by the compiler for each array element is dependent on the size of the largest data item within the structure. The compiler determines the size of the array elements as an exact multiple of the largest data item in the derived-type structure without the SEQUENCE statement or a record structure. The compiler then adds the appropriate number of padding bytes.
For instance, if a structure contains an 8-byte floating-point number followed by a 3-byte character variable, each element contains five bytes of padding (16 is an exact multiple of 8). However, if the structure contains one 4-byte floating-point number, one 4-byte integer, followed by a 3-byte character variable, each element would contain one byte of padding (12 is an exact multiple of 4).
During compilation, the Compaq Fortran compiler naturally aligns as much data as possible. Exceptions that can result in unaligned data are described in Section 5.4.1.
Because unaligned data can slow run-time performance, it is worthwhile to:
There are two ways unaligned data might be reported:
Unaligned access pid=24821 <a.out> va=140000154, pc=3ff80805d60, ra=1200017bc |
For new programs or when the source declarations of an existing program can be easily modified, plan the order of your data declarations carefully to ensure the data items in a common block, derived-type data, record structure, or data items made equivalent by an EQUIVALENCE statement will be naturally aligned.
Use the following rules to prevent unaligned data:
When declaring data, consider using explicit length declarations, such as specifying a KIND parameter. For example, specify INTEGER(KIND=4) (or INTEGER(4)) rather than INTEGER. If you do use a default length (such as INTEGER, LOGICAL, COMPLEX, and REAL), be aware that the compiler options -integer_size and -real_size can change the size of an individual field's data declaration size and thus can alter the data alignment of a carefully planned order of data declarations.
Using the suggested data declaration guidelines minimizes the need to
use the
-align keyword
options to add padding bytes to ensure naturally aligned data. In cases
where the
-align keyword
options are still needed, using the suggested data declaration
guidelines can minimize the number of padding bytes added by the
compiler.
5.4.3.1 Arranging Data Items in Common Blocks
The order of data items in a COMMON statement determine the order in which the data items are stored. Consider the following declaration of a common block named X:
LOGICAL (KIND=2) FLAG INTEGER IARRY_I(3) CHARACTER(LEN=5) NAME_CH COMMON /X/ FLAG, IARRY_I(3), NAME_CH |
As shown in Figure 5-1, if you omit the appropriate f90 command options, the common block will contain unaligned data items beginning at the first array element of IARRY_I.
Figure 5-1 Common Block with Unaligned Data
As shown in Figure 5-2, if you compile the program units that use the common block with the -align commons options, data items will be naturally aligned.
Figure 5-2 Common Block with Naturally Aligned Data
Because the common block X contains data items whose size is 32 bits or smaller, specify -align commons . If the common block contains data items whose size might be larger than 32 bits (such as REAL (KIND=8) data), use -align dcommons .
If you can easily modify the source files that use the common block data, define the numeric variables in the COMMON statement in descending order of size and place the character variable last. This provides more portability, ensures natural alignment without padding, and does not require the f90 command options -align commons or -align dcommons :
LOGICAL (KIND=2) FLAG INTEGER IARRY_I(3) CHARACTER(LEN=5) NAME_CH COMMON /X/ IARRY_I(3), FLAG, NAME_CH |
As shown in Figure 5-3, if you arrange the order of variables from largest to smallest size and place character data last, the data items will be naturally aligned.
Figure 5-3 Common Block with Naturally Aligned Reordered Data
When modifying or creating all source files that use common block data,
consider placing the common block data declarations in a module so the
declarations are consistent. If the common block is not needed for
compatibility (such as file storage or Compaq Fortran 77 use), you can
place the data declarations in a module without using a common block.
5.4.3.2 Arranging Data Items in Derived-Type Data
Like common blocks, derived-type data may contain multiple data items (members).
Data item components within derived-type data will be naturally aligned on up to 64-bit boundaries, with certain exceptions related to the use of the SEQUENCE statement and f90 options. See Section 5.4.4 for information about these exceptions.
Compaq Fortran stores a derived data type as a linear sequence of values, as follows:
Consider the following declaration of array CATALOG_SPRING of derived-type PART_DT:
MODULE DATA_DEFS TYPE PART_DT INTEGER IDENTIFIER REAL WEIGHT CHARACTER(LEN=15) DESCRIPTION END TYPE PART_DT TYPE (PART_DT) CATALOG_SPRING(30) . . . END MODULE DATA_DEFS |
As shown in Figure 5-4, the largest numeric data items are defined first and the character data type is defined last. There are no padding characters between data items and all items are naturally aligned. The trailing padding byte is needed because CATALOG_SPRING is an array; it is inserted by the compiler when the -align records option is in effect.
Figure 5-4 Derived-Type Naturally Aligned Data (in CATALOG_SPRING : ( ,))
Compaq Fortran supports record structures provided by Compaq Fortran. Compaq Fortran record structures use the RECORD statement and optionally the STRUCTURE statement, which are extensions to the FORTRAN-77 and Fortran 95/90 standards. The order of data items in a STRUCTURE statement determine the order in which the data items are stored.
Compaq Fortran stores a record in memory as a linear sequence of values, with the record's first element in the first storage location and its last element in the last storage location. Unless you specify -align norecords , padding bytes are added if needed to ensure data fields are naturally aligned.
The following example contains a structure declaration, a RECORD statement, and diagrams of the resulting records as they are stored in memory:
STRUCTURE /STRA/ CHARACTER*1 CHR INTEGER*4 INT END STRUCTURE . . . RECORD /STRA/ REC |
Figure 5-5 shows the memory diagram of record REC for naturally aligned records.
Figure 5-5 Memory Diagram of REC for Naturally Aligned Records
The following options control whether the Compaq Fortran compiler adds padding (when needed) to naturally align multiple data items in common blocks, derived-type data, and Compaq Fortran record structures:
The default behavior is that multiple data items in derived-type data and record structures will be naturally aligned; data items in common blocks will not ( -align records with -align nocommons ). In derived-type data, using the SEQUENCE statement prevents -align records from adding needed padding bytes to naturally align data items.
If your command line includes the
-std
,
-std90
, or
-std95
options, then the compiler ignores
-align dcommons
and
-align sequence
. See Section 3.85.
5.5 Using Arrays Efficiently
The following sections discuss:
On Alpha systems, many of the array access efficiency techniques described in this section are applied automatically by the Compaq Fortran loop transformation optimizations (see Section 5.8.7) or by the Compaq KAP Fortran/OpenMP for Tru64 UNIX Systems performance preprocessor (described in Section 5.1.1).
Several aspects of array use can improve run-time performance:
A = A + 1. |
REAL :: A(100,100) A = 0.0 A = A + 1. ! Increment all elements of A by 1 . . . WRITE (8) A ! Fast whole array use |
TYPE X INTEGER A(5) END TYPE X . . . TYPE (X) Z WRITE (8) Z%A ! Fast array structure component use |
INTEGER X(3,5), Y(3,5), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Inefficient row-major storage order END DO ! (rightmost subscript varies fastest) END DO . . . END PROGRAM |
INTEGER X(3,5), Y(3,5), I, J Y = 0 DO J=1,5 ! J outer loop varies slowest DO I=1,3 ! I inner loop varies fastest X (I,J) = Y(I,J) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM |
INTEGER X(5,3), Y(5,3), I, J Y = 0 DO I=1,3 ! I outer loop varies slowest DO J=1,5 ! J inner loop varies fastest X (J,I) = Y(J,I) + 1 ! Efficient column-major storage order END DO ! (leftmost subscript varies fastest) END DO . . . END PROGRAM |
REAL A (512,100) DO I = 2,511 DO J = 2,99 A(I,J)=(A(I+1,J-1) + A(I-1, J+1)) * 0.5 END DO END DO |
In Fortran 95/90, there are two general types of array arguments:
When passing arrays as arguments, either the starting (base) address of the array or the address of an array descriptor is passed:
Passing an assumed-shape array or array pointer to an explicit-shape array can slow run-time performance. This is because the compiler needs to create an array temporary for the entire array. The array temporary is created because the passed array may not be contiguous and the receiving (explicit-shape) array requires a contiguous array. When an array temporary is created, the size of the passed array determines whether the impact on slowing run-time performance is slight or severe.
Table 5-3 summarizes what happens with the various combinations of array types. The amount of run-time performance inefficiency depends on the size of the array.
Input Arguments Array Types | Explicit-Shape Arrays | Deferred-Shape and Assumed-Shape Arrays |
---|---|---|
Explicit-shape arrays | Very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional. | Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays). Does not use an array temporary. Passes an array descriptor. Requires an interface block. |
Deferred-shape and assumed-shape arrays |
When passing an allocatable array, very efficient. Does not use an
array temporary. Does not pass an array descriptor. Interface block
optional.
When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible. Uses an array temporary. Does not pass an array descriptor. Interface block optional. |
Efficient. Requires an assumed-shape or array pointer as dummy argument. Does not use an array temporary. Passes an array descriptor. Requires an interface block. |
Improving overall I/O performance can minimize both device I/O and actual CPU time. The techniques listed in this section can greatly improve performance in many applications.
A bottleneck limits the maximum speed of execution by being the slowest process in an executing program. In some programs, I/O is the bottleneck that prevents an improvement in run-time performance. The key to relieving I/O bottlenecks is to reduce the actual amount of CPU and I/O device time involved in I/O.
Bottlenecks can be caused by one or more of the following:
Improved coding practices can minimize actual device I/O, as well as the actual CPU time.
Compaq offers software solutions to system-wide problems like
minimizing device I/O delays (see Section 5.1.1).
5.6.1 Use Unformatted Files Instead of Formatted Files
Use unformatted files whenever possible. Unformatted I/O of numeric data is more efficient and more precise than formatted I/O. Native unformatted data does not need to be modified when transferred and will take up less space on an external file.
Conversely, when writing data to formatted files, formatted data must be converted to character strings for output, less data can transfer in a single operation, and formatted data may lose precision if read back into binary form.
To write the array A(25,25) in the following statements, S1 is more efficient than S2:
S1 WRITE (7) A S2 WRITE (7,100) A 100 FORMAT (25(' ',25F5.21)) |
Although formatted data files are more easily ported to other systems,
Compaq Fortran can convert unformatted data in several formats (see
Chapter 10).
5.6.2 Write Whole Arrays or Strings
The general guidelines about array use discussed in Section 5.5 also apply to reading or writing an array with an I/O statement.
To eliminate unnecessary overhead, write whole arrays or strings at one
time rather than individual elements at multiple times. Each item in an
I/O list generates its own calling sequence. This processing overhead
becomes most significant in implied-DO loops. When accessing whole
arrays, use the array name (Fortran 95/90 array syntax) instead of
using implied-DO loops.
5.6.3 Write Array Data in the Natural Storage Order
Use the natural ascending storage order whenever possible. This is column-major order, with the leftmost subscript varying fastest and striding by 1. (See Section 5.5.1, Accessing Arrays Efficiently.) If a program must read or write data in any other order, efficient block moves are inhibited.
If the whole array is not being written, natural storage order is the best order possible.
If you must use an unnatural storage order, in certain
cases it might be more efficient to transfer the data to memory and
reorder the data before performing the I/O operation.
5.6.4 Use Memory for Intermediate Results
Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device. One situation that may not benefit from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system. Excessive page faults can dramatically impede virtual memory performance.
If you are primarily concerned with the CPU performance of the system,
consider using a memory file system (mfs) virtual disk to hold any
files your code reads or writes (see mfs(1)).
5.6.5 Enable Implied-DO Loop Collapsing
DO loop collapsing reduces a major overhead in I/O processing. Normally, each element in an I/O list generates a separate call to the Compaq Fortran RTL. The processing overhead of these calls can be most significant in implied-DO loops.
Compaq Fortran reduces the number of calls in implied-DO loops by replacing up to seven nested implied-DO loops with a single call to an optimized run-time library I/O routine. The routine can transmit many I/O elements at once.
Loop collapsing can occur in formatted and unformatted I/O, but only if certain conditions are met:
Variable format expressions (a Compaq Fortran extension) are almost as flexible as run-time formatting, but they are more efficient because the compiler can eliminate run-time parsing of the I/O format. Only a small amount of processing and the actual data transfer are required during run time.
On the other hand, run-time formatting can impair performance significantly. For example, in the following statements, S1 is more efficient than S2 because the formatting is done once at compile time, not at run time:
S1 WRITE (6,400) (A(I), I=1,N) 400 FORMAT (1X, <N> F5.2) . . . S2 WRITE (CHFMT,500) '(1X,',N,'F5.2)' 500 FORMAT (A,I3,A) WRITE (6,FMT=CHFMT) (A(I), I=1,N) |
Records being read or written are transferred between the user's program buffers and one or more disk block I/O buffers, which are established when the file is opened by the Compaq Fortran RTL. Unless very large records are being read or written, multiple logical records can reside in the disk block I/O buffer when it is written to disk or read from disk, minimizing physical disk I/O.
You can specify the size of the disk block physical I/O buffer by using the OPEN statement BLOCKSIZE specifier; the default size can be obtained from fstat(2). If you omit the BLOCKSIZE specifier in the OPEN statement, it is set for optimal I/O use with the type of device the file resides on (with the exception of network access).
The OPEN statement BUFFERCOUNT specifier specifies the number of I/O buffers. The default for BUFFERCOUNT is 1. Any experiments to improve I/O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value, to increase the amount of data read by each disk I/O.
If the OPEN statement has BLOCKSIZE and BUFFERCOUNT specifiers, then the internal buffer size in bytes is the product of these specifiers. If the OPEN statement does not have these specifiers, then the default internal buffer size is 8192 bytes. This internal buffer will grow to hold the largest single record, but will never shrink.
The default for the Fortran run-time system is to use unbuffered disk writes. That is, by default, records are written to disk immediately as each record is written instead of accumulating in the buffer to be written to disk later.
To enable buffered writes (that is, to allow the disk device to fill the internal buffer before the buffer is written to disk), use one of the following:
The OPEN statement BUFFERED specifier takes precedence over the -assume buffered_io option. If neither one is set (which is the default), the FORT_BUFFERED environment variable is tested at run time.
The OPEN statement BUFFERED specifier applies to a specific logical unit. In contrast, the -assume [no]buffered_io option and the FORT_BUFFERED environment variable apply to all Fortran units.
Using buffered writes usually makes disk I/O more efficient by writing larger blocks of data to the disk less often. However, a system failure when using buffered writes can cause records to be lost, since they might not yet have been written to disk. (Such records would have been written to disk with the default unbuffered writes.)
When performing I/O across a network, be aware that the size of the block of network data sent across the network can impact application efficiency. When reading network data, follow the same advice for efficient disk reads, by increasing the BUFFERCOUNT. When writing data through the network, several items should be considered:
When writing records, be aware that I/O records are written to unified buffer cache (UBC) system buffers. To request that I/O records be written from program buffers to the UBC system buffers, use the flush library routine (see flush(3f) and Chapter 12). Be aware that calling flush also discards read-ahead data in user buffer.
To request that UBC system buffers be written to disk, use the fsync library routine (see fsync(3f) and Chapter 12).
When UBC buffers are written to disk depends on UBC characteristics on the system, such as the vm-ubcbuffers attribute (see the Compaq Tru64 UNIX System Tuning and Performance guide).
For More Information:
The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).
The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).
The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl option to request 1-byte units (see Section 3.7).
When porting unformatted data files from non-Compaq systems, see
Section 10.6.
5.6.9 Use the Optimal Record Type
Unless a certain record type is needed for portability reasons (see Section 7.4.3), choose the most efficient type, as follows:
Due to certain precautions that the Fortran run-time system takes to ensure the integrity of standard input, reads can be very slow when standard input is redirected from a file. For example, when you use a command such as myprogram.exe < myinput.data , the data is read using the READ(*) or READ(5) statement, and performance is degraded. To avoid this problem, do one of the following:
OPEN(5, STATUS='OLD', FILE='myinput.dat') |
setenv FORT5=myinput.dat |
setenv FOR_READ=myinput.dat |
To take advantage of these methods, be sure your program does not rely on sharing the standard input file.
Other source coding guidelines can be implemented to improve run-time performance.
The amount of improvement in run-time performance is related to the
number of times a statement is executed. For example, improving an
arithmetic expression executed within a loop many times has the
potential to improve performance, more than improving a similar
expression executed once outside a loop.
5.7.1 Avoid Small Integer and Small Logical Data Items
Avoid using integer or logical data less than 32 bits, because the smallest unit of efficient access on Alpha systems is 32 bits.
Accessing a 16-bit (or 8-bit) data type can result in a sequence of machine instructions to access the data, rather than a single, efficient machine instruction for a 32-bit data item.
To minimize data storage and memory cache misses with arrays, use
32-bit data rather than 64-bit data, unless you require the greater
numeric range of 8-byte integers or the greater range and precision of
double precision floating-point numbers.
5.7.2 Avoid Mixed Data Type Arithmetic Expressions
Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.
For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data:
Original Code: |
INTEGER I, J
I = J / 2. |
Efficient Code: |
INTEGER I, J
I = J / 2 |
For applications with numerous floating-point operations, consider using the -fp_reorder option (see Section 5.9.7) if a small difference in the result is acceptable.
You can use different sizes of the same general data type in an
expression with minimal or no effect on run-time performance. For
example, using REAL, DOUBLE PRECISION, and COMPLEX floating-point
numbers in the same floating-point arithmetic expression has minimal or
no effect on run-time performance.
5.7.3 Use Efficient Data Types
In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient:
However, keep in mind that in an arithmetic expression, you should
avoid mixing integer and floating-point (REAL) data (see Section 5.7.2).
5.7.4 Avoid Using Slow Arithmetic Operators
Before you modify source code to avoid slow arithmetic operators, be aware that optimizations convert many slow arithmetic operators to faster arithmetic operators. For example, the compiler optimizes the expression H=J**2 to be H=J*J.
Consider also whether replacing a slow arithmetic operator with a faster arithmetic operator will change the accuracy of the results or impact the maintainability (readability) of the source code.
Replacing slow arithmetic operators with faster ones should be reserved for critical code areas. The following hierarchy lists the Compaq Fortran arithmetic operators, from fastest to slowest:
Avoid using EQUIVALENCE statements. EQUIVALENCE statements can:
Whenever the Compaq Fortran compiler has access to the use and definition of a subprogram during compilation, it may choose to inline the subprogram. Using statement functions and internal subprograms maximizes the number of subprogram references that will be inlined, especially when multiple source files are compiled together at optimization level -o4 or higher.
For More Information:
Minimize the arithmetic operations and other operations in a DO loop whenever possible. Moving unnecessary operations outside the loop will improve performance (for example, when the intermediate nonvarying values within the loop are not needed).
Compaq Fortran performs many optimizations by default. You do not have to recode your program to use them. However, understanding how optimizations work helps you remove any inhibitors to their successful function.
Generally, Compaq Fortran increases compile time in favor of decreasing run time. If an operation can be performed, eliminated, or simplified at compile time, Compaq Fortran does so, rather than have it done at run time. The time required to compile the program usually increases as more optimizations occur.
The program will likely execute faster when compiled at -o4 , but will require more compilation time than if you compile the program at a lower level of optimization.
The size of object file varies with the optimizations requested. Factors that can increase object file size include an increase of loop unrolling or procedure inlining.
Table 5-4 lists the levels of Compaq Fortran optimization with different -o options. For example: -o0 specifies no selectable optimizations (some optimizations always occur); -o5 specifies all levels of optimizations, including loop transformation.
Optimization Type | --O0 | --O1 | --O2 | --O3 | --O4 | --O5 |
---|---|---|---|---|---|---|
Loop transformation | X | |||||
Software pipelining | X | X | ||||
Automatic inlining | X | X | ||||
Additional global optimizations | X | X | X | |||
Global optimizations | X | X | X | X | ||
Local (minimal) optimizations | X | X | X | X | X |
The default is -o4 (same as -o ). However, if -g2 , -g , or -gen_feedback is also specified, the default is -o0 (no optimizations).
In Table 5-4, the following terms are used to describe the levels of optimization:
The following optimizations occur at any optimization level ( -o0 through -o5 ):
SUM(A,B) = A+B . . . Y = 3.14 X = SUM(Y,3.0) ! With value propagation, becomes: X = 6.14 |
To enable local optimizations, use -o1 or a higher optimization level ( -o2 , -o3 , -o4 , or -o5 ).
To prevent local optimizations, specify the
-o0
option.
5.8.2.1 Common Subexpression Elimination
If the same subexpressions appear in more than one computation and the values do not change between computations, Compaq Fortran computes the result once and replaces the subexpressions with the result itself:
DIMENSION A(25,25), B(25,25) A(I,J) = B(I,J) |
Without optimization, these statements can be compiled as follows:
t1 = ((J-1)*25+(I-1))*4 t2 = ((J-1)*25+(I-1))*4 A(t1) = B(t2) |
Variables t1 and t2 represent equivalent expressions. Compaq Fortran eliminates this redundancy by producing the following:
t = ((J-1)*25+(I-1))*4 A(t) = B(t) |
Expansion of multiplication and division refers to bit shifts that allow faster multiplication and division while producing the same result. For example, the integer expression (I*17) can be calculated as I with a 4-bit shift plus the original value of I. This can be expressed using the Compaq Fortran ISHFT intrinsic function:
J1 = I*17 J2 = ISHFT(I,4) + I ! equivalent expression for I*17 |
The optimizer uses machine code that, like the ISHFT intrinsic
function, shifts bits to expand multiplication and division by literals.
5.8.2.3 Compile-Time Operations
Compaq Fortran does as many operations as possible at compile time rather than at run time.
Compaq Fortran can perform many operations on constants (including PARAMETER constants):
PARAMETER (NN=27) I = 2*NN+J ! Becomes: I = 54 + J |
REAL X, Y X = 10 * Y ! Becomes: X = 10.0 * Y |
INTEGER I(10,10) I(1,2) = I(4,5) ! Compiled as a direct load and store |
Algebraic Reassociation Optimizations
Compaq Fortran delays operations to see whether they have no effect or can be transformed to have no effect. If they have no effect, these operations are removed. A typical example involves unary minus and .NOT. operations:
X = -Y * -Z ! Becomes: Y * Z |
Compaq Fortran tracks the values assigned to variables and constants, including those from DATA statements, and traces them to every place they are used. Compaq Fortran uses the value itself when it is more efficient to do so.
When compiling subprograms, Compaq Fortran analyzes the program to ensure that propagation is safe if the subroutine is called more than once.
Value propagation frequently leads to more value propagation. Compaq Fortran can eliminate run-time operations, comparisons and branches, and whole statements.
In the following example, constants are propagated, eliminating multiple operations from run time:
Original Code | Optimized Code |
---|---|
PI = 3.14
.
. . PIOVER2 = PI/2 . . . I = 100 . . . IF (I.GT.1) GOTO 10 10 A(I) = 3.0*Q |
.
. . PIOVER2 = 1.57 . . . I = 100 . . . 10 A(100) = 3.0*Q |
If a variable is assigned but never used, Compaq Fortran eliminates the entire assignment statement:
X = Y*Z . . .=Y*Z is eliminated. X = A(I,J)* PI |
Some programs used for performance analysis often contain such
unnecessary operations. When you try to measure the performance of such
programs compiled with Compaq Fortran, these programs may show
unrealistically good performance results. Realistic results are
possible only with program units using their results in output
statements.
5.8.2.6 Register Usage
A large program usually has more data that would benefit from being held in registers than there are registers to hold the data. In such cases, Compaq Fortran typically tries to use the registers according to the following descending priority list:
Compaq Fortran uses heuristic algorithms and a modest amount of computation to attempt to determine an effective usage for the registers.
Holding Variables in Registers
Because operations using registers are much faster than using memory, Compaq Fortran generates code that uses the Alpha 64-bit integer and floating-point registers instead of memory locations. Knowing when Compaq Fortran uses registers may be helpful when doing certain forms of debugging.
Compaq Fortran uses registers to hold the values of variables whenever the Fortran language does not require them to be held in memory, such as holding the values of temporary results of subexpressions, even if -o0 (no optimization) was specified.
Compaq Fortran may hold the same variable in different registers at different points in the program:
V = 3.0*Q . . . X = SIN(Y)*V . . . V = PI*X . . . Y = COS(Y)*V |
Compaq Fortran may choose one register to hold the first use of V and another register to hold the second. Both registers can be used for other purposes at points in between. There may be times when the value of the variable does not exist anywhere in the registers. If the value of V is never needed in memory, it is never assigned.
Compaq Fortran uses registers to hold the values of I, J, and K (so long as there are no other optimization effects, such as loops involving the variables):
A(I) = B(J) + C(K) |
More typically, an expression uses the same index variable:
A(K) = B(K) + C(K) |
In this case, K is loaded into only one register and is used to index
all three arrays at the same time.
5.8.2.7 Mixed Real/Complex Operations
In mixed REAL/COMPLEX operations, Compaq Fortran avoids the conversion and performs a simplified operation on:
For example, if variable R is REAL and A and B are COMPLEX, no conversion occurs with the following:
COMPLEX A, B . . . B = A + R |
To enable global optimizations, use -o2 or a higher optimization level ( -o3 , -o4 , or -o5 ). Using -o2 or higher also enables local optimizations ( -o1 ).
Global optimizations include:
Data flow analysis and split lifetime analysis (global data analysis) traces the values of variables and whole arrays as they are created and used in different parts of a program unit. During this analysis, Compaq Fortran assumes that any pair of array references to a given array might access the same memory location, unless a constant subscript is used in both cases.
To eliminate unnecessary recomputations of invariant expressions in loops, Compaq Fortran hoists them out of the loops so they execute only once.
Global data analysis includes which data items are selected for analysis. Some data items are analyzed as a group and some are analyzed individually. Compaq Fortran limits or may disqualify data items that participate in the following constructs, generally because it cannot fully trace their values:
COMMON /X/ I DO J=1,N I = J CALL FOO A(I) = I ENDDO |
To enable additional global optimizations, use -o3 or a higher optimization level ( -o4 or -o5 ). Using -o3 or higher also enables local optimizations ( -o1 ) and global optimizations ( -o2 ).
Additional global optimizations improve speed at the cost of longer
compile times and possibly extra code size.
5.8.4.1 Loop Unrolling
At optimization level -o3 or above, Compaq Fortran attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.
As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.
The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.
The loop unroller also inserts data prefetches for arrays with affine subscripts. Prefetches (that is, prefetch instructions) can be inserted even if the unroller chooses not to unroll. On some architectures (21264 and later), write-hint instructions are also generated.
The number of times a loop is unrolled can be determined either by the optimizer or by using the -unroll num option, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer will choose an unroll amount that minimizes the overhead of prefetching while also limiting code size expansion.
Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:
A(1:100,2:30) = B(1:100,1:29) * 2.0 |
For More Information:
In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.
Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.
For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R0 is register 0):
Unoptimized Instructions | Optimized (Replicated) Instructions |
---|---|
. |
. |
Similarly, code replication can also occur within a loop that contains
a small amount of shared code at the bottom of a loop and a case-type
dispatch within the loop. The loop-end test-and-branch code might be
replicated at the end of each case to create efficient instruction
pipelining within the code for each case.
5.8.5 Automatic Inlining
To enable optimizations that perform automatic inlining, use -o4 or a higher optimization level ( -o5 ). Using -o4 also enables local optimizations ( -o1 ), global optimizations ( -o2 ), and additional global optimizations ( -o3 ).
The default is
-o4
(unless
-g2
,
-g
, or
-gen_feedback
is specified).
5.8.5.1 Interprocedure Analysis
Compiling multiple source files at optimization level -o4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:
As more procedures are inlined, the size of the executable program and
compile times may increase, but execution time should decrease.
5.8.5.2 Inlining Procedures
Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.
The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:
You can specify:
Software pipelining and additional software dependence analysis are enabled by using the -pipeline option, the -o4 option, or the -o5 option. Software pipelining in certain cases improves run-time performance.
Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.
Software pipelining also includes associated additional software dependence analysis and enables the prefetching of data to reduce the impact of cache misses.
Loop unrolling (enabled at -o3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.
For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.
Software pipelining can be more effective when you combine -pipeline (or -o4 or -o5 ) with the appropriate -tune keyword for the target Alpha processor generation (see Section 5.9.4).
To specify software pipelining without loop transformation optimizations, do one of the following:
This optimization is not performed at optimization levels below -o2 .
Loops chosen for software pipelining:
By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using -o4 or -o5 (or -pipeline ) may not improve performance, as compared to using -o3 .
This option might increase compilation time and/or program size. For programs that contain loops that exhaust available registers, longer execution times may occur. In this case, specify options -unroll 1 or -unroll 2 with the -pipeline option.
To determine whether using -pipeline benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without software pipelining (such as with -pipeline and -nopipeline ).
For programs that contain loops that exhaust available registers, longer execution times may result with -o4 or -o5 , requiring use of -unroll n to limit loop unrolling (see Section 3.94).
The loop transformation optimizations are enabled by using the -transform_loops option or the -o5 option. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.
To request loop transformation optimizations without software pipelining, do one of the following:
This optimization is not performed at optimization levels below -o2 .
You must specify -notransform_loops if you want this type of optimization disabled and you are also specifying -o5 .
The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops.
The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number of iterations before entering the loop. For example, DO and IF loops are normally counted loops, but uncounted DO WHILE loops are not.
Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.
The types of optimizations associated with -transform_loops include the following:
To determine whether using -transform_loops benefits your particular program, you should time program execution for the same program (or subprogram) compiled with and without loop transformation optimizations (such as with -transform_loops and -notransform_loops ).
In addition to the
-on
options (discussed in Section 5.8), several other
f90
command options can prevent or facilitate improved optimizations.
5.9.1 Setting Multiple Options with the -fast Option
Specifying the
-fast
option sets many performance options. For details, see Section 3.40, -fast --- Set Options to Improve Run-Time Performance.
5.9.2 Controlling the Number of Times a Loop Is Unrolled
You can specify the number of times a loop is unrolled by using the -unroll num option (see Section 3.94).
The -unroll num option can also influence the run-time results of software pipelining optimizations performed when you specify one of the following:
Although unrolling loops usually improves run-time performance, the size of the executable program may increase.
To specify the types of procedures to be inlined, use the -inline keyword option. Also, compile multiple source files together and specify an adequate optimization level, such as -o4 .
If you omit -noinline and the -inline keyword options, the optimization level -on option used determines the types of procedures that are inlined.
Maximizing the types of procedures that are inlined usually improves run-time performance, but compile-time memory usage and the size of the executable program may increase.
To determine whether using -inline all benefits your particular program, time program execution for the same program compiled with and without -inline all .
For More Information:
You can specify the types of optimized code to be generated by using the -tune keyword and -arch keyword options. Regardless of the specified keyword, the generated code will run correctly on all implementations of the Alpha architecture. Tuning for a specific implementation can improve run-time performance; it is also possible that code tuned for a specific target may run slower on another target.
Specifying the correct keyword for -tune keyword for the target processor generation type usually slightly improves run-time performance. Unless you request software pipelining, the run-time performance difference for using the wrong keyword for -tune keyword (such as using -tune ev4 for an ev5 processor) is usually less than 5%. When using software pipelining (using -o4 or -o5 ) with -tune keyword , the difference can be more than 5%.
The combination of the specified keyword for -tune keyword and the type of processor generation used has no effect on producing the expected correct program results.
For More Information:
(TU*X ONLY) Speculative execution reduces instruction latency stalls to improve run-time performance for certain programs or routines. Speculative execution evaluates conditional code (including exceptions) and moves instructions that would otherwise be executed conditionally to a position before the test, so they are executed unconditionally.
The default, -speculate none , means that the speculative execution code scheduling optimization is not used and exceptions are reported as expected. You can specify -speculate all or -speculate by_routine to request the speculative execution optimization.
Performance improvements may be reduced because the run-time system must dismiss exceptions caused by speculative instructions. For certain programs, longer execution times may result when using the speculative execution optimization. To determine whether using -speculate all or -speculate by_routine benefits your particular program, you should time the program execution with one of these options for the same program compiled with -speculate none (default).
Speculative execution does not support some run-time error checking, since exception and signal processing (including SIGSEGV, SIGBUS, and SIGFPE) is conditional. When the program needs to be debugged or while you are testing for errors, only use -speculate none .
When you specify -non_shared to request a nonshared object file, you can specify the -om option to request code optimizations after linking, including nop (No Operation) removal, .lita removal, and reallocation of common symbols. This option also positions the global pointer register so the maximum addresses fall in the global-pointer window.
If you use the -fp_reorder option (or -assume noaccuracy_sensitive , which are equivalent), Compaq Fortran may reorder code (based on algebraic identities) to improve performance.
For example, the following expressions are mathematically equivalent but may not compute the same value using finite precision arithmetic:
X = (A + B) + C X = A + (B + C) |
The results can be slightly different from the default -no_fp_reorder because of the way intermediate results are rounded. However, the -no_fp_reorder results are not categorically less accurate than those gained by the default. In fact, dot product summations using -fp_reorder can produce more accurate results than those using -no_fp_reorder .
The effect of -fp_reorder is important when Compaq Fortran hoists divide operations out of a loop. If -fp_reorder is in effect, the unoptimized loop becomes the optimized loop:
Unoptimized Code | Optimized Code |
---|---|
T = 1/V | |
DO I=1,N | DO I=1,N |
. | . |
. | . |
. | . |
B(I) = A(I)/V | B(I) = A(I)*T |
END DO | END DO |
The transformation in the optimized loop increases performance significantly, and loses little or no accuracy. However, it does have the potential for raising overflow or underflow arithmetic exceptions.
The compiler can also reorder code based on algebraic identities to
improve performance if you specify
-fast
.
5.9.8 Dummy Aliasing Assumption
Some programs compiled with Compaq Fortran (or Compaq Fortran 77) may have results that differ from the results of other Fortran compilers. Such programs may be aliasing dummy arguments to each other or to a variable in a common block or shared through use association, and at least one variable access is a store.
This program behavior is prohibited in programs conforming to the Fortran 95/90 standards, but not by Compaq Fortran. Other versions of Fortran allow dummy aliases and check for them to ensure correct results. However, Compaq Fortran assumes that no dummy aliasing will occur, and it can ignore potential data dependencies from this source in favor of faster execution.
The Compaq Fortran default is safe for programs conforming to the Fortran 95/90 standards. It will improve performance of these programs, because the standard prohibits such programs from passing overlapped variables or arrays as actual arguments if either is assigned in the execution of the program unit.
The -assume dummy_aliases option allows dummy aliasing. It ensures correct results by assuming the exact order of the references to dummy and common variables is required. Program units taking advantage of this behavior can produce inaccurate results if compiled with -assume nodummy_aliases .
Example 5-1 is taken from the DAXPY routine in the Fortran-77 version of the Basic Linear Algebra Subroutines (BLAS).
Example 5-1 Using the -assume dummy_aliases Option |
---|
SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY) C Constant times a vector plus a vector. C uses unrolled loops for increments equal to 1. DOUBLE PRECISION DX(1), DY(1), DA INTEGER I,INCX,INCY,IX,IY,M,MP1,N C IF (N.LE.0) RETURN IF (DA.EQ.0.0) RETURN IF (INCX.EQ.1.AND.INCY.EQ.1) GOTO 20 C Code for unequal increments or equal increments C not equal to 1. . . . RETURN C Code for both increments equal to 1. C Clean-up loop 20 M = MOD(N,4) IF (M.EQ.0) GOTO 40 DO I=1,M DY(I) = DY(I) + DA*DX(I) END DO IF (N.LT.4) RETURN 40 MP1 = M + 1 DO I = MP1, N, 4 DY(I) = DY(I) + DA*DX(I) DY(I + 1) = DY(I + 1) + DA*DX(I + 1) DY(I + 2) = DY(I + 2) + DA*DX(I + 2) DY(I + 3) = DY(I + 3) + DA*DX(I + 3) END DO RETURN END SUBROUTINE |
The second DO loop contains assignments to DY. If DY is overlapped with DA, any of the assignments to DY might give DA a new value, and this overlap would affect the results. If this overlap is desired, then DA must be fetched from memory each time it is referenced. The repetitious fetching of DA degrades performance.
Linking Routines with Opposite Settings
You can link routines compiled with the -assume dummy_aliases option to routines compiled with -assume nodummy_aliases . For example, if only one routine is called with dummy aliases, you can use -assume dummy_aliases when compiling that routine, and compile all the other routines with -assume nodummy_aliases to gain the performance value of that option.
Programs calling DAXPY with DA overlapping DY do not conform to the FORTRAN-77 and Fortran 95/90 standards. However, they are supported if -assume dummy_aliases was used to compile the DAXPY routine.
Previous | Next | Contents | Index |