Previous | Contents | Index |
A large program usually has more data that would benefit from being held in registers than there are registers to hold the data. In such cases, HP Fortran typically tries to use the registers according to the following descending priority list:
HP Fortran uses heuristic algorithms and a modest amount of computation to attempt to determine an effective usage for the registers.
Holding Variables in Registers
Because operations using registers are much faster than using memory, HP Fortran generates code that uses 64-bit integer and floating-point registers instead of memory locations. Knowing when HP Fortran uses registers may be helpful when doing certain forms of debugging.
HP Fortran uses registers to hold the values of variables whenever the Fortran language does not require them to be held in memory, such as holding the values of temporary results of subexpressions, even if /NOOPTIMIZE (same as /OPTIMIZE=LEVEL=0 or no optimization) was specified.
HP Fortran may hold the same variable in different registers at different points in the program:
V = 3.0*Q . . . X = SIN(Y)*V . . . V = PI*X . . . Y = COS(Y)*V |
HP Fortran might choose one register to hold the first use of V and another register to hold the second. Both registers can be used for other purposes at points in between. There may be times when the value of the variable does not exist anywhere in the registers. If the value of V is never needed in memory, it is never stored.
HP Fortran uses registers to hold the values of I, J, and K (so long as there are no other optimization effects, such as loops involving the variables):
A(I) = B(J) + C(K) |
More typically, an expression uses the same index variable:
A(K) = B(K) + C(K) |
In this case, K is loaded into only one register and is used to index
all three arrays at the same time.
5.7.2.7 Mixed Real/Complex Operations
In mixed REAL/COMPLEX operations, HP Fortran avoids the conversion and performs a simplified operation on:
For example, if variable R is REAL and A and B are COMPLEX, no conversion occurs with the following:
COMPLEX A, B . . . B = A + R |
To enable global optimizations, use /OPTIMIZE=LEVEL=2 or a higher optimization level (LEVEL=3, LEVEL=4, or LEVEL=5). Using /OPTIMIZE= LEVEL=2 or higher also enables local optimizations (LEVEL=1).
Global optimizations include:
Data-flow and split lifetime analysis (global data analysis) traces the values of variables and whole arrays as they are created and used in different parts of a program unit. During this analysis, HP Fortran assumes that any pair of array references to a given array might access the same memory location, unless a constant subscript is used in both cases.
To eliminate unnecessary recomputations of invariant expressions in loops, HP Fortran hoists them out of the loops so they execute only once.
Global data analysis includes which data items are selected for analysis. Some data items are analyzed as a group and some are analyzed individually. HP Fortran limits or may disqualify data items that participate in the following constructs, generally because it cannot fully trace their values.
Data items in the following constructs can make global optimizations less effective:
COMMON /X/ I DO J=1,N I = J CALL FOO A(I) = I ENDDO |
To enable additional global optimizations, use /OPTIMIZE=LEVEL=3 or a higher optimization level (LEVEL=4 or LEVEL=5). Using /OPTIMIZE= LEVEL=3 or higher also enables local optimizations (LEVEL=1) and global optimizations (LEVEL=2).
Additional global optimizations improve speed at the cost of longer
compile times and possibly extra code size.
5.7.4.1 Loop Unrolling
At optimization level /OPTIMIZE=LEVEL=3 or above, HP Fortran attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.
As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.
The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.
The number of times a loop is unrolled can be determined either by the optimizer or by using the /OPTIMIZE=UNROLL=n qualifier, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer unrolls a loop four times for most loops or two times for certain loops (large estimated code size or branches out the loop).
Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:
A(1:100,2:30) = B(1:100,1:29) * 2.0 |
In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.
Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.
For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R4 is register 4):
Unoptimized Instructions | Optimized (Replicated) Instructions |
---|---|
. |
. |
Similarly, code replication can also occur within a loop that contains
a small amount of shared code at the bottom of a loop and a case-type
dispatch within the loop. The loop-end test-and-branch code might be
replicated at the end of each case to create efficient instruction
pipelining within the code for each case.
5.7.5 Automatic Inlining and Software Pipelining
To enable optimizations that perform automatic inlining and software pipelining, use /OPTIMIZE=LEVEL=4 or a higher optimization level (LEVEL=5). Using /OPTIMIZE=LEVEL=4 also enables local optimizations (LEVEL=1), global optimizations (LEVEL=2), and additional global optimizations (LEVEL=3).
The default is /OPTIMIZE=LEVEL=4 (same as /OPTIMIZE).
5.7.5.1 Interprocedure Analysis
Compiling multiple source files at optimization level /OPTIMIZE=LEVEL=4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:
As more procedures are inlined, the size of the executable program and
compile times may increase, but execution time should decrease.
5.7.5.2 Inlining Procedures
Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.
The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:
You can specify:
Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.
Software pipelining also enables the prefetching of data to reduce the impact of cache misses.
A group of optimizations known as loop transformation optimizations with its associated additional software dependence analysis are enabled by using the /OPTIMIZE=LEVEL=5 qualifier. In certain cases, this improves run-time performance.
The loop transformation optimizations apply to array references within loops and can apply to multiple nested loops. These optimizations can improve the performance of the memory system.
In addition to the /OPTIMIZE=LEVEL qualifiers (discussed in
Section 5.7), several other FORTRAN command qualifiers and /OPTIMIZE
keywords can prevent or facilitate improved optimizations.
5.8.1 Loop Transformation
The loop transformation optimizations are enabled by using the /OPTIMIZE=LOOPS qualifier or the /OPTIMIZE=LEVEL=5 qualifier. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.
To request loop transformation optimizations without software pipelining, do one of the following:
The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops. The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number before entering the loop. For example, most DO loops are counted loops.
Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.
The types of optimizations associated with /OPTIMIZE=LOOPS include the following:
On the interaction of command-line options and timing programs compiled
with the loop transformation optimizations, see Section 5.7.
5.8.2 Software Pipelining
Software pipelining and additional software dependence analysis are enabled by using the /OPTIMIZE=PIPELINE qualifier or by the /OPTIMIZE=LEVEL=4 qualifier. Software pipelining in certain cases improves run-time performance.
The software pipelining optimization applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.
Loop unrolling (enabled at /OPTIMIZE=LEVEL=3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.
For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.
Software pipelining also enables the prefetching of data to reduce the impact of cache misses.
On Alpha systems, software pipelining can be more effective when you combine /OPTIMIZE=PIPELINE (or /OPTIMIZE=LEVEL=4) with the appropriate OPTIMIZE=TUNE=keyword (Alpha only) for the target Alpha processor generation (see Section 5.8.6).
To specify software pipelining without loop transformation optimizations, do one of the following:
For this version of HP Fortran, loops chosen for software pipelining:
By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using /OPTIMIZE=LEVEL=4 (or /OPTIMIZE=PIPELINE) may not improve performance, as compared to using /OPTIMIZE=(LEVEL=4,NOPIPELINE).
For programs that contain loops that exhaust available registers, longer execution times may result with /OPTIMIZE=LEVEL=4 or /OPTIMIZE=PIPELINE. In cases where performance does not improve, consider compiling with the OPTIMIZE=UNROLL=1 qualifier along with /OPTIMIZE=LEVEL=4 or /OPTIMIZE=PIPELINE, to possibly improve the effects of software pipelining.
On the interaction of command-line options and timing programs compiled
with software pipelining, see Section 5.7.
5.8.3 Setting Multiple Qualifiers with the /FAST Qualifier
Specifying the /FAST qualifier sets the following qualifiers:
You can specify individual qualifiers on the command line to override
the /FAST defaults. Note that /FAST/ALIGNMENT=COMMONS=PACKED sets
/ALIGNMENT=NOSEQUENCE.
5.8.4 Controlling Loop Unrolling
You can specify the number of times a loop is unrolled by using the /OPTIMIZE=UNROLL=n qualifier (see Section 2.3.35).
Using /OPTIMIZE=UNROLL=n can also influence the run-time results of software pipelining optimizations performed when you specify /OPTIMIZE=LEVEL=5.
Although unrolling loops usually improves run-time performance, the size of the executable program may increase.
On loop unrolling, see Section 5.7.4.1.
Previous | Next | Contents | Index |