Compaq Fortran
User Manual for
Tru64 UNIX and
Linux Alpha Systems

Chapter 6
Parallel Compiler Directives and Their Programming Environment

Note

The information in this chapter pertains only to Compaq Fortran on Tru64 UNIX systems.

This chapter describes two sets of parallel compiler directives:

The following topics apply to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives:

Note

The compiler can recognize either OpenMP directives or Compaq Fortran directives in a program, but not members of both sets of directives in a program.

For reference material on both sets of parallel compiler directives, see the Compaq Fortran Language Reference Manual.

6.1 OpenMP Fortran API Compiler Directives

Note

These directives comply with OpenMP Fortran 1.1 Application Program Interface, as described in the specification at:

http://www.openmp.org/specs/

These topics are described:

Command-line option and directives format (see Section 6.1.1)
Directive summary descriptions (see Section 6.1.2)
Parallel processing thread model (see Section 6.1.3)
Privatizing named common blocks (see Section 6.1.4)
Controlling data scope attributes (see Section 6.1.5)
Parallel region (see Section 6.1.6)
Worksharing constructs (see Section 6.1.7)
Combined parallel/worksharing constructs (see Section 6.1.8)
Synchronization constructs (see Section 6.1.9)
Specifying schedule type and chunk size (see Section 6.1.10)

6.1.1 Command-Line Option and Directives Format

To use OpenMP Fortran API compiler directives in your program, you must include the -omp compiler option on your f90 command:

% f90 -omp prog.f -o prog

Directives are structured so that they appear to be Compaq Fortran comments. The format of an OpenMP Fortran API compiler directive is:

prefix directive_name [clause[[,] clause]...]

All OpenMP Fortran API compiler directives must begin with a directive prefix. Directives are not case-sensitive. Clauses can appear in any order after the directive name and can be repeated as needed, subject to the restrictions of individual clauses.

Directives cannot be embedded within continued statements, and statements cannot be embedded within directives. Comments can appear on the same line as a directive.

6.1.1.1 Directive Prefixes

The directive prefix you use depends on the source form you use in your program:

Use the !$OMP prefix when compiling either fixed source form or free source form programs.
Use the C$OMP and the *$OMP prefixes only when compiling fixed source form programs.

Fixed Source Form

For fixed source form programs, the prefix is one of the following:

!$OMP
C$OMP
*$OMP

Prefixes must start in column 1 and appear as a single string with no intervening white space. Fixed-form source rules apply to the directive line.

Initial directive lines must have a space or zero in column 6, and continuation directive lines must have a character other than a space or a zero in column 6. For example, the following formats for specifying directives are equivalent:

c23456789 !$OMP PARALLEL DO SHARED(A,B,C) !Is the same as... c$OMP PARALLEL DO c$OMP+SHARED(A,B,C) !Which is the same as... c$OMP PARALLEL DO SHARED(A,B,C)

Free Source Form

For free source form programs, the prefix is !$OMP.

The prefix can appear in any column as long as it is preceded only by white space. It must appear as a single string with no intervening white space. Free-form source rules apply to the directive line.

Initial directive lines must have a space after the prefix. Continued directive lines must have an ampersand as the last nonblank character on the line. Continuation directive lines can have an ampersand after the directive prefix with optional white space before and after the ampersand. For example, the following formats for specifying directives are equivalent:

!$OMP PARALLEL DO & !$OMP SHARED(A,B,C) !The same as... !$OMP PARALLEL & !$OMP&DO SHARED(A,B,C) !Which is the same as... !$OMP PARALLEL DO SHARED(A,B,C)

6.1.1.2 Directive Prefixes for Conditional Compilation

OpenMP Fortran API allows you to conditionally compile Compaq Fortran statements. The directive prefix you use for conditional compilation statements depends on the source form you use in your program:

Use the !$ prefix when compiling either fixed source form or free source form programs.
Use the C$ (or c$) and the *$ prefixes only when compiling fixed source form programs.

The prefix must be followed by a legal Compaq Fortran statement on the same line. When you use the -omp compiler option, the prefix is replaced by two spaces and the rest of the line is treated as a normal Compaq Fortran statement during compilations. You can also use the C preprocessor macro _OPENMP for conditional compilation.

Fixed Source Form

For fixed source form programs, the conditional compilation prefix is one of the following: !$ , C$ (or c$), or *$.

The prefix must start in column 1 and appear as a single string with no intervening white space. Fixed-form source rules apply to the directive line.

Initial lines must have a space or zero in column 6, and continuation lines must have a character other than a space or zero in column 6. For example, the following forms for specifying conditional compilation are equivalent:

c23456789 !$ IAM = OMP_GET_THREAD_NUM() + !$ * INDEX #IFDEF _OPENMP IAM = OMP_GET_THREAD_NUM() + * INDEX #ENDIF

Free Source Form

The free source form conditional compilation prefix is !$. This prefix can appear in any column as long as it is preceded only by white space. It must appear as a single word with no intervening white space. Free-form source rules apply to the directive line.

Initial lines must have a space after the prefix. Continued lines must have an ampersand as the last nonblank character on the line. Continuation lines can have an ampersand after the prefix with optional white space before and after the ampersand.

6.1.2 Summary Descriptions of OpenMP Fortran API Compiler Directives

Table 6-1 provides summary descriptions of the OpenMP Fortran API compiler directives. For complete information about the OpenMP Fortran API compiler directives, see the Compaq Fortran Language Reference Manual.

Table 6-1 OpenMP Fortran API Compiler Directives
Directive
Format Description

prefix ATOMIC

This directive defines a synchronization construct that ensures that a specific memory location is updated atomically.
See Section 6.1.9.1, ATOMIC Directive.

prefix BARRIER

This directive defines a synchronization construct that synchronizes all the threads in a team.
See Section 6.1.9.2, BARRIER Directive.

prefix CRITICAL [(name)]

block

prefix END CRITICAL [(name)]

These directives define a synchronization construct that restricts access to the contained code to only one thread at a time.
See Section 6.1.9.3, CRITICAL and END CRITICAL Directives.

prefix DO [clause[[,] clause] ...]

do_loop

[prefix END DO [NOWAIT]]

These directives define a worksharing construct that specifies that the iterations of the DO loop are executed in parallel.
See Section 6.1.7.1, DO and END DO directives.

prefix FLUSH [(var[,var]...)]

This directive defines a synchronization construct that identifies the precise point at which a consistent view of memory is provided.
See Section 6.1.9.4, FLUSH Directive.

prefix MASTER

block

prefix END MASTER

These directives define a synchronization construct that specifies that the contained block of code is to be executed only by the master thread of the team.
See Section 6.1.9.5, MASTER and END MASTER Directives.

prefix ORDERED

block

prefix END ORDERED

These directives define a synchronization construct that specifies that the contained block of code is executed in the order in which iterations would be executed during a sequential execution of the loop.
See Section 6.1.9.6, ORDERED and END ORDERED Directives.

prefix PARALLEL [clause[[,] clause] ...]

block

prefix END PARALLEL

These directives define a parallel construct that is a region of a program that must be executed by a team of threads until the END PARALLEL directive is encountered.
See Section 6.1.6, Parallel Region: PARALLEL and END PARALLEL Directives.

prefix PARALLEL DO [clause[[,] clause] ...]

do_loop

prefix END PARALLEL DO

These directives define a combined parallel/worksharing construct that is an abbreviated form of specifying a parallel region that contains a single DO directive.
See Section 6.1.8.1, PARALLEL DO and END PARALLEL DO Directives.

prefix PARALLEL SECTIONS [clause[[,] clause] ...]

block

prefix END PARALLEL SECTIONS

These directives define a combined parallel/worksharing construct that is an abbreviated form of specifying a parallel region that contains a single SECTIONS directive.
See Section 6.1.8.2, PARALLEL SECTIONS and END PARALLEL SECTIONS Directives.

prefix SECTIONS [clause[[,] clause] ...]

[prefix SECTION]

block

[prefix SECTION

block ] .
.
.
prefix END SECTIONS [NOWAIT]

These directives define a worksharing construct that specifies that the enclosed sections of code are to be divided among threads in the team. Each section is executed once by some thread in the team.
See Section 6.1.7.2, SECTIONS, SECTION, and END SECTIONS Directives.

prefix SINGLE [clause[[,] clause] ...]

block

prefix END SINGLE [NOWAIT]

These directives define a worksharing construct that specifies that the enclosed code is to be executed by only one thread in the team.
See Section 6.1.7.3, SINGLE and END SINGLE Directives.

prefix THREADPRIVATE(/cb/[,/cb/] ...)

This data environment directive makes named common blocks private to a thread, but global within the thread.
See Section 6.1.4, Privatizing Named Common Blocks: THREADPRIVATE Directive.

6.1.3 Parallel Processing Thread Model

A program containing OpenMP Fortran API compiler directives begins execution as a single process, called the master thread of execution. The master thread executes sequentially until the first parallel construct is encountered.

In OpenMP Fortran API, the PARALLEL and END PARALLEL directives define the parallel construct. When the master thread encounters a parallel construct, it creates a team of threads, with the master thread becoming the master of the team. The program statements enclosed by the parallel construct are executed in parallel by each thread in the team. These statements include routines called from within the enclosed statements.

The statements enclosed lexically within a construct define the static extent of the construct. The dynamic extent includes the static extent as well as the routines called from within the construct. When the END PARALLEL directive is encountered, the threads in the team synchronize at that point, the team is dissolved, and only the master thread continues execution. The other threads in the team enter a wait state.

You can specify any number of parallel constructs in a single program. As a result, thread teams can be created and dissolved many times during program execution.

In routines called from within parallel constructs, you can also use directives. Directives that are not in the lexical extent of the parallel construct, but are in the dynamic extent, are called orphaned directives. Orphaned directives allow you to execute major portions of your program in parallel with only minimal changes to the sequential version of the program. Using this functionality, you can code parallel constructs at the top levels of your program call tree and use directives to control execution in any of the called routines.

For example:

subroutine F ... !$OMP parallel... ... call G ... subroutine G ... !$OMP DO... ...

The !$OMP DO is an orphaned directive because the parallel region it will execute in is not lexically present in G.

A parallel region is a block of code that must be executed by a team of threads in parallel.

A worksharing construct is the heart of parallel processing. A worksharing construct divides the execution of the enclosed code region among the members of the team created on entering the enclosing parallel region.

A combined parallel/worksharing construct denotes a parallel region that contains only one worksharing construct.

Synchronization is the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads. Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed. A synchronization construct is used to assure this consistency of shared data.

A data environment directive controls the data environment during the execution of parallel constructs.

You can control the data environment within parallel and worksharing constructs. Using directives and data environment clauses on directives, you can:

Privatize named common blocks (see Section 6.1.4)
Control data scope attributes (see Section 6.1.5)

6.1.4 Privatizing Named Common Blocks: THREADPRIVATE Directive

You can make named common blocks private to a thread, but global within the thread, by using the THREADPRIVATE directive.

Each thread gets its own copy of the common block with the result that data written to the common block by one thread is not directly visible to other threads. During serial portions and MASTER sections of the program, accesses are to the master thread copy of the common block.

You cannot use a thread private common block or its constituent variables in any clause other than the COPYIN clause.

In the following example, common blocks BLK1 and FIELDS are specified as thread private:

COMMON /BLK1/ SCRATCH COMMON /FIELDS/ XFIELD, YFIELD, ZFIELD !$OMP THREADPRIVATE(/BLK1/,/FIELDS/)

6.1.5 Controlling Data Scope Attributes

You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them. If you do not specify a data scope attribute clause on a directive, the default is SHARED for those variables affected by the directive.

Each of the data scope attribute clauses accepts a list, which is a comma-separated list of named variables or named common blocks that are accessible in the scoping unit. When you specify named common blocks, they must appear between slashes ( /name/ ).

Not all of the clauses are allowed on all directives, but the directives to which each clause applies are listed in the clause descriptions.

The data scope attribute clauses are:

COPYIN
DEFAULT
PRIVATE
FIRSTPRIVATE
LASTPRIVATE
REDUCTION
SHARED

COPYIN Clause

Use the COPYIN clause on the PARALLEL, PARALLEL DO, and PARALLEL SECTIONS directives to copy the data in the master thread common block to the thread private copies of the common block. The copy occurs at the beginning of the parallel region. The COPYIN clause applies only to common blocks that have been declared THREADPRIVATE (see Section 6.1.4).

You do not have to specify a whole common block to be copied in; you can specify named variables that appear in the THREADPRIVATE common block. In the following example, the common blocks BLK1 and FIELDS are specified as thread private, but only one of the variables in common block FIELDS is specified to be copied in:

COMMON /BLK1/ SCRATCH COMMON /FIELDS/ XFIELD, YFIELD, ZFIELD !$OMP THREADPRIVATE(/BLK1/, /FIELDS/) !$OMP PARALLEL DEFAULT(PRIVATE),COPYIN(/BLK1/,ZFIELD)

DEFAULT Clause

Use the DEFAULT clause on the PARALLEL, PARALLEL DO, and PARALLEL SECTIONS directives to specify a default data scope attribute for all variables within the lexical extent of a parallel region. Variables in THREADPRIVATE common blocks are not affected by this clause. You can specify only one DEFAULT clause on a directive. The default data scope attribute can be one of the following:

PRIVATE
Makes all named objects in the lexical extent of the parallel region private to a thread. The objects include common block variables, but exclude THREADPRIVATE variables.
SHARED
Makes all named objects in the lexical extent of the parallel region shared among all the threads in the team.
NONE
Declares that there is no implicit default as to whether variables are PRIVATE or SHARED. You must explicitly specify the scope attribute for each variable in the lexical extent of the parallel region.

If you do not specify the DEFAULT clause, the default is DEFAULT(SHARED). However, loop control variables are always PRIVATE by default.

You can exempt variables from the default data scope attribute by using other scope attribute clauses on the parallel region as shown in the following example:

!$OMP PARALLEL DO DEFAULT(PRIVATE), FIRSTPRIVATE(I),SHARED(X), !$OMP& SHARED(R) LASTPRIVATE(I)

PRIVATE Clause

Use the PRIVATE clause on the PARALLEL, DO, SECTIONS, SINGLE, PARALLEL DO, and PARALLEL SECTIONS directives to declare variables to be private to each thread in the team.

The behavior of variables declared PRIVATE is as follows:

A new object of the same type and size is declared once for each thread in the team, and the new object is no longer storage associated with the original object.
All references to the original object in the lexical extent of the directive construct are replaced with references to the private object.
Variables defined as PRIVATE are undefined for each thread on entering the construct, and the corresponding shared variable is undefined on exit from a parallel construct.
Contents, allocation state, and association status of variables defined as PRIVATE are undefined when they are referenced outside the lexical extent, but inside the dynamic extent, of the construct unless they are passed as actual arguments to called routines.

In the following example, the values of I and J are undefined on exit from the parallel region:

INTEGER I,J I =1 J =2 !$OMP PARALLEL PRIVATE(I) FIRSTPRIVATE(J) I =3 J =J+ 2 !$OMP END PARALLEL PRINT *, I, J

FIRSTPRIVATE Clause

Use the FIRSTPRIVATE clause on the PARALLEL, DO, SECTIONS, SINGLE, PARALLEL DO, and PARALLEL SECTIONS directives to provide a superset of the PRIVATE clause functionality.

In addition to the PRIVATE clause functionality, private copies of the variables are initialized from the original object existing before the parallel construct.

LASTPRIVATE Clause

Use the LASTPRIVATE clause on the DO, SECTIONS, PARALLEL DO, and PARALLEL SECTIONS directives to provide a superset of the PRIVATE clause functionality.

When the LASTPRIVATE clause appears on a DO or PARALLEL DO directive, the thread that executes the sequentially last iteration updates the version of the object it had before the construct.

When the LASTPRIVATE clause appears on a SECTIONS or PARALLEL SECTIONS directive, the thread that executes the lexically last section updates the version of the object it had before the construct.

Subobjects that are not assigned a value by the last iteration of the DO loop or the lexically last SECTION directive are undefined after the construct.

Correct execution sometimes depends on the value that the last iteration of a loop assigns to a variable. You must list all such variables as arguments to a LASTPRIVATE clause so that the values of the variables are the same as when the loop is executed sequentially. As shown in the following example, the value of I at the end of the parallel region is equal to N+1, as it would be with sequential execution.

!$OMP PARALLEL !$OMP DO LASTPRIVATE(I) DO I=1,N A(I) = B(I) + C(I) END DO !$OMP END PARALLEL CALL REVERSE(I)

REDUCTION Clause

Use the REDUCTION clause on the PARALLEL, DO, SECTIONS, PARALLEL DO, and PARALLEL SECTIONS directives to perform a reduction on the specified variables by using an operator or intrinsic as shown:

REDUCTION (

operator
intrinsic

:list )

Operator can be one of the following: +, *, -, .AND., .OR., .EQV., or .NEQV..

Intrinsic can be one of the following: MAX, MIN, IAND, IOR, or IEOR.

The specified variables must be named scalar variables of intrinsic type and must be SHARED in the enclosing context. A private copy of each specified variable is created for each thread as if you had used the PRIVATE clause. The private copy is initialized to a value that depends on the operator or intrinsic as shown in Table 6-2. The actual initialization value will be consistent with the data type of the reduction variable.

Table 6-2 Operators/Intrinsics and Initialization Values for Reduction Variables
Operator/Intrinsic Initialization Value

+ 0

* 1

- 0

.AND. .TRUE.

.OR. .FALSE.

.EQV. .TRUE.

.NEQV. .FALSE.

MAX Smallest representable number

MIN Largest representable number

IAND All bits on

IOR 0

IEOR 0

**Table 6-2 Operators/Intrinsics and Initialization Values for Reduction Variables**
Operator/Intrinsic	Initialization Value
+	0
*	1
-	0
.AND.	.TRUE.
.OR.	.FALSE.
.EQV.	.TRUE.
.NEQV.	.FALSE.
MAX	Smallest representable number
MIN	Largest representable number
IAND	All bits on
IOR	0
IEOR	0

At the end of the construct to which the reduction applies, the shared variable is updated to reflect the result of combining the original value of the SHARED reduction variable with the final value of each of the private copies using the specified operator.

Except for subtraction, all of the reduction operators are associative and the compiler can freely reassociate the computation of the final value. The partial results of a subtraction reduction are added to form the final value.

The value of the shared variable becomes undefined when the first thread reaches the clause containing the reduction, and it remains undefined until the reduction computation is complete. Normally, the computation is complete at the end of the REDUCTION construct. However, if you use the REDUCTION clause on a construct to which NOWAIT is also applied, the shared variable remains undefined until a barrier synchronization has been performed. This ensures that all of the threads have completed the REDUCTION clause.

The REDUCTION clause is intended to be used on a region or worksharing construct in which the reduction variable is used only in reduction statements having one of the following forms:

x = x operator expr x = expr operator x (except for subtraction) x = intrinsic (x,expr) x = intrinsic (expr, x)

Some reductions can be expressed in other forms. For instance, a MAX reduction might be expressed as follows:

IF (x .LT. expr) x = expr

Alternatively, the reduction might be hidden inside a subroutine call. Be careful that the operator specified in the REDUCTION clause matches the reduction operation.

Any number of reduction clauses can be specified on the directive, but a variable can appear only once in a REDUCTION clause for that directive as shown in the following example:

!$OMP DO REDUCTION(+: A, Y),REDUCTION(.OR.: AM)

The following example shows how to use the REDUCTION clause:

!$OMP PARALLEL DO DEFAULT(PRIVATE),SHARED(A,B),REDUCTION(+: A,B) DO I=1,N CALL WORK(ALOCAL,BLOCAL) A = A + ALOCAL B = B + BLOCAL END DO !$OMP END PARALLEL DO

SHARED Clause

Use the SHARED clause on the PARALLEL, PARALLEL DO, and PARALLEL SECTIONS directives to make variables shared among all the threads in a team.

In the following example, the variables X and NPOINTS are shared among all the threads in the team:

!$OMP PARALLEL DEFAULT(PRIVATE),SHARED(X,NPOINTS) IAM = OMP_GET_THREAD_NUM() NP = OMP_GET_NUM_THREADS() IPOINTS = NPOINTS/NP CALL SUBDOMAIN(X,IAM,IPOINTS) !$OMP END PARALLEL

6.1.6 Parallel Region: PARALLEL and END PARALLEL Directives

Note

For overview information, see Section 6.1.3, Parallel Processing Thread Model.

The PARALLEL and END PARALLEL directives define a parallel region as follows:

!$OMP PARALLEL !parallel region !$OMP END PARALLEL

When a thread encounters a parallel region, it creates a team of threads and becomes the master of the team. You can control the number of threads in a team by the use of an environment variable or a run-time library call, or both.

For More Information:

The PARALLEL directive takes an optional comma-separated list of clauses that specifies:

Whether the statements in the parallel region are executed in parallel by a team of threads or serially by a single thread (IF clause)
Whether variables are PRIVATE, FIRSTPRIVATE, SHARED, or REDUCTION
Whether variables have a DEFAULT data scope attribute
Whether master thread common block values are copied to THREADPRIVATE copies of the common block (COPYIN clause)

Once created, the number of threads in the team remains constant for the duration of that parallel region. However, you can explicitly change the number of threads used in the next parallel region by calling the OMP_SET_NUM_THREADS run-time library routine from a serial portion of the program. This routine overrides any value you may have set using the OMP_NUM_THREADS environment variable.

Assuming you have used the OMP_NUM_THREADS environment variable to set the number of threads to 6, you can change the number of threads between parallel regions as follows:

CALL OMP_SET_NUM_THREADS(3) !$OMP PARALLEL . . . !$OMP END PARALLEL CALL OMP_SET_NUM_THREADS(4) !$OMP PARALLEL DO . . . !$OMP END PARALLEL DO

Use the worksharing directives such as DO, SECTIONS, and SINGLE to divide the statements in the parallel region into units of work and to distribute those units so that each unit is executed by one thread.

In the following example, the !$OMP DO and !$OMP END DO directives and all the statements enclosed by them comprise the static extent of the parallel region:

!$OMP PARALLEL !$OMP DO DO I=1,N B(I) = (A(I) + A(I-1)) / 2.0 END DO !$OMP END DO !$OMP END PARALLEL

In the following example, the !$OMP DO and !$OMP END DO directives and all the statements enclosed by them, including all statements contained in the WORK subroutine, comprise the dynamic extent of the parallel region:

!$OMP PARALLEL DEFAULT(SHARED) !$OMP DO DO I = 1, N CALL WORK(I,N) END DO !$OMP END DO !$OMP END PARALLEL

When an IF clause is present on the PARALLEL directive, the enclosed code region is executed in parallel only if the scalar logical expression evaluates to TRUE. Otherwise, the parallel region is serialized. When there is no IF clause, the region is executed in parallel by default.

In the following example, the statements enclosed within the !$OMP DO and !$OMP END DO directives are executed in parallel only if there are more than three processors available. Otherwise the statements are executed serially:

!$OMP PARALLEL IF (OMP_GET_NUM_PROCS() .GT. 3) !$OMP DO DO I=1,N Y(I) = SQRT(Z(I)) END DO !$OMP END DO !$OMP END PARALLEL

If a thread executing a parallel region encounters another parallel region, it creates a new team and becomes the master of that new team. By default, nested parallel regions are always executed by a team of one thread.

To achieve better performance than sequential execution, a parallel region must contain one or more worksharing constructs so that the team of threads can execute work in parallel. It is the contained worksharing constructs that lead to the performance enhancements offered by parallel processing.

6.1.7 Worksharing Constructs

A worksharing construct must be enclosed dynamically within a parallel region if the worksharing directive is to execute in parallel. No new threads are launched and there is no implied barrier on entry to a worksharing construct.

The worksharing constructs are:

DO and END DO directives (see Section 6.1.7.1)
SECTIONS, SECTION, and END SECTIONS directives (see Section 6.1.7.2)
SINGLE and END SINGLE directives (see Section 6.1.7.3)

6.1.7.1 DO and END DO directives

The DO directive specifies that the iterations of the immediately following DO loop must be dispatched across the team of threads so that each iteration is executed by a single thread. The loop that follows a DO directive cannot be a DO WHILE or a DO loop that does not have loop control. The iterations of the DO loop are dispatched among the existing team of threads.

You cannot use a GOTO statement, or any other statement, to transfer control into or out of the DO construct.

If you specify the optional END DO directive, it must appear immediately after the end of the DO loop. If you do not specify the END DO directive, an END DO directive is assumed at the end of the DO loop, and threads synchronize at that point.

The loop iteration variable is private by default, so it is not necessary to declare it explicitly.

The clauses for the DO directive specify:

Whether variables are PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION
How loop iterations are SCHEDULEd onto threads

In addition, the ORDERED clause must be specified if the ORDERED directive appears in the dynamic extent of the DO directive.

If you do not specify the optional NOWAIT clause on the END DO directive, threads synchronize at the END DO directive. If you specify NOWAIT, threads do not synchronize, and threads that finish early proceed directly to the instructions following the END DO directive.

The DO directive optionally lets you:

Control data scope attributes (see Section 6.1.5, Controlling Data Scope Attributes).
Use the SCHEDULE clause to specify schedule type and chunk size (see Section 6.1.10, Specifying Schedule Type and Chunk Size).

6.1.7.2 SECTIONS, SECTION, and END SECTIONS Directives

Use the noniterative worksharing SECTIONS directive to divide the enclosed sections of code among the team. Each section is executed just one time by one thread.

Each section should be preceded with a SECTION directive, except for the first section, in which the SECTION directive is optional. The SECTION directive must appear within the lexical extent of the SECTIONS and END SECTIONS directives.

The last section ends at the END SECTIONS directive. When a thread completes its section and there are no undispatched sections, it waits at the END SECTION directive unless you specify NOWAIT.

The SECTIONS directive takes an optional comma-separated list of clauses that specifies which variables are PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION.

The following example shows how to use the SECTIONS and SECTION directives to execute subroutines XAXIS, YAXIS, and ZAXIS in parallel. The first SECTION directive is optional:

!$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION CALL XAXIS !$OMP SECTION CALL YAXIS !$OMP SECTION CALL ZAXIS !$OMP END SECTIONS !$OMP END PARALLEL

For More Information:

See Section 6.1.5, Controlling Data Scope Attributes.

6.1.7.3 SINGLE and END SINGLE Directives

Use the SINGLE directive when you want just one thread of the team to execute the enclosed block of code.

Threads that are not executing the SINGLE directive wait at the END SINGLE directive unless you specify NOWAIT.

The SINGLE directive takes an optional comma-separated list of clauses that specifies which variables are PRIVATE or FIRSTPRIVATE. that specifies which variables are PRIVATE or FIRSTPRIVATE.

When the END SINGLE directive is encountered, an implicit barrier is erected and threads wait until all threads have finished. This can be overridden by using the NOWAIT option.

In the following example, the first thread that encounters the SINGLE directive executes subroutines OUTPUT and INPUT:

!$OMP PARALLEL DEFAULT(SHARED) CALL WORK(X) !$OMP BARRIER !$OMP SINGLE CALL OUTPUT(X) CALL INPUT(Y) !$OMP END SINGLE CALL WORK(Y) !$OMP END PARALLEL

For More Information:

See Section 6.1.5, Controlling Data Scope Attributes.

6.1.8 Combined Parallel/Worksharing Constructs

The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:

PARALLEL DO (see Section 6.1.8.1)
PARALLEL SECTIONS (see Section 6.1.8.2)

6.1.8.1 PARALLEL DO and END PARALLEL DO Directives

Use the PARALLEL DO directive to specify a parallel region that implicitly contains a single DO directive.

You can specify one or more of the clauses for the PARALLEL and the DO directives.

The following example shows how to parallelize a simple loop. The loop iteration variable is private by default, so it is not necessary to declare it explicitly. The END PARALLEL DO directive is optional:

!$OMP PARALLEL DO DO I=1,N B(I) = (A(I) + A(I-1)) / 2.0 END DO !$OMP END PARALLEL DO

For More Information:

6.1.8.2 PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

Use the PARALLEL SECTIONS directive to specify a parallel region that implicitly contains a single SECTIONS directive.

You can specify one or more of the clauses for the PARALLEL and the SECTIONS directives.

The last section ends at the END PARALLEL SECTIONS directive.

In the following example, subroutines XAXIS, YAXIS, and ZAXIS can be executed concurrently. The first SECTION directive is optional. Note that all SECTION directives must appear in the lexical extent of the PARALLEL SECTIONS/END PARALLEL SECTIONS construct:

!$OMP PARALLEL SECTIONS !$OMP SECTION CALL XAXIS !$OMP SECTION CALL YAXIS !$OMP SECTION CALL ZAXIS !$OMP END PARALLEL SECTIONS

For More Information:

6.1.9 Synchronization Constructs

Synchronization constructs are used to assure the consistency of shared data and to coordinate parallel execution among threads.

The synchronization constructs are:

ATOMIC directive (see Section 6.1.9.1)
BARRIER directive (see Section 6.1.9.2)
CRITICAL directive (see Section 6.1.9.3)
FLUSH directive (see Section 6.1.9.4)
MASTER directive (see Section 6.1.9.5)
ORDERED directive (see Section 6.1.9.6)

6.1.9.1 ATOMIC Directive

Use the ATOMIC directive to ensure that a specific memory location is updated atomically instead of exposing the location to the possibility of multiple, simultaneously writing threads.

This directive applies only to the immediately following statement, which must have one of the following forms:

x = x operator expr x = expr operator x x = intrinsic (x, expr) x = intrinsic (expr, x)

In the preceding statements:

x is a scalar variable of intrinsic type
expr is a scalar expression that does not reference x
intrinsic is either MAX, MIN, IAND, IOR, or IEOR
operator is either +, *, -, /, .AND., .OR., .EQV., or .NEQV.

This directive permits optimization beyond that of a critical section around the assignment. An implementation can replace all ATOMIC directives by enclosing the statement in a critical section. All of these critical sections must use the same unique name.

Only the load and store of x are atomic; the evaluation of expr is not atomic. To avoid race conditions, all updates of the location in parallel must be protected by using the ATOMIC directive, except those that are known to be free of race conditions. The function intrinsic, the operator operator, and the assignment must be the intrinsic function, operator, and assignment.

This restriction applies to the ATOMIC directive: All references to storage location x must have the same type and type parameters.

In the following example, the collection of Y locations is updated atomically:

!$OMP ATOMIC Y = Y + B(I)

6.1.9.2 BARRIER Directive

To synchronize all threads within a parallel region, use the BARRIER directive. You can use this directive only within a parallel region defined by using the PARALLEL directive. You cannot use the BARRIER directive within the DO, PARALLEL DO, SECTIONS, PARALLEL SECTIONS, and SINGLE directives.

When encountered, each thread waits at the BARRIER directive until all threads have reached the directive.

In the following example, the BARRIER directive ensures that all threads have executed the first loop and that it is safe to execute the second loop:

c$OMP PARALLEL c$OMP DO PRIVATE(i) DO i = 1, 100 b(i) = i END DO c$OMP BARRIER c$OMP DO PRIVATE(i) DO i = 1, 100 a(i) = b(101-i) END DO c$OMP END PARALLEL

6.1.9.3 CRITICAL and END CRITICAL Directives

Use the CRITICAL and END CRITICAL directives to restrict access to a block of code, referred to as a critical section, to one thread at a time.

A thread waits at the beginning of a critical section until no other thread in the team is executing a critical section having the same name.

When a thread enters the critical section, a latch variable is set to closed and all other threads are locked out. When the thread exits the critical section at the END CRITICAL directive, the latch variable is set to open, allowing another thread access to the critical section.

If you specify a critical section name in the CRITICAL directive, you must specify the same name in the END CRITICAL directive. If you do not specify a name for the CRITICAL directive, you cannot specify a name for the END CRITICAL directive.

All unnamed CRITICAL directives map to the same name. Critical section names are global to the program.

The following example includes several CRITICAL directives, and illustrates a queuing model in which a task is dequeued and worked on. To guard against multiple threads dequeuing the same task, the dequeuing operation must be in a critical section. Because there are two independent queues in this example, each queue is protected by CRITICAL directives having different names, XAXIS and YAXIS, respectively:

!$OMP PARALLEL DEFAULT(PRIVATE),SHARED(X,Y) !$OMP CRITICAL(XAXIS) CALL DEQUEUE(IX_NEXT, X) !$OMP END CRITICAL(XAXIS) CALL WORK(IX_NEXT, X) !$OMP CRITICAL(YAXIS) CALL DEQUEUE(IY_NEXT,Y) !$OMP END CRITICAL(YAXIS) CALL WORK(IY_NEXT, Y) !$OMP END PARALLEL

Unnamed critical sections use the global lock from the Pthread package. This allows you to synchronize with other code by using the same lock. Named locks are created and maintained by the compiler and can be significantly more efficient.

6.1.9.4 FLUSH Directive

Use the FLUSH directive to identify a synchronization point at which a consistent view of memory is provided. Thread-visible variables are written back to memory at this point.

To avoid flushing all thread-visible variables at this point, include a list of comma-separated named variables to be flushed.

The following example uses the FLUSH directive for point-to-point synchronization between thread 0 and thread 1 for the variable ISYNC:

!$OMP PARALLEL DEFAULT(PRIVATE),SHARED(ISYNC) IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0 !$OMP BARRIER CALL WORK() ! I Am Done With My Work, Synchronize With My Neighbor ISYNC(IAM) = 1 !$OMP FLUSH(ISYNC) ! Wait Till Neighbor Is Done DO WHILE (ISYNC(NEIGH) .EQ. 0) !$OMP FLUSH(ISYNC) END DO !$OMP END PARALLEL

6.1.9.5 MASTER and END MASTER Directives

Use the MASTER and END MASTER directives to identify a block of code that is executed only by the master thread.

The other threads of the team skip the code and continue execution. There is no implied barrier at the END MASTER directive.

In the following example, only the master thread executes the routines OUTPUT and INPUT:

!$OMP PARALLEL DEFAULT(SHARED) CALL WORK(X) !$OMP MASTER CALL OUTPUT(X) CALL INPUT(Y) !$OMP END MASTER CALL WORK(Y) !$OMP END PARALLEL

6.1.9.6 ORDERED and END ORDERED Directives

Use the ORDERED and END ORDERED directives within a DO construct to allow work within an ordered section to execute sequentially while allowing work outside the section to execute in parallel.

When you use the ORDERED directive, you must also specify the ORDERED clause on the DO directive.

Only one thread at a time is allowed to enter the ordered section, and then only in the order of loop iterations.

In the following example, the code prints out the indexes in sequential order:

!$OMP DO ORDERED,SCHEDULE(DYNAMIC) DO I=LB,UB,ST CALL WORK(I) END DO SUBROUTINE WORK(K) !$OMP ORDERED WRITE(*,*) K !$OMP END ORDERED

6.1.10 Specifying Schedule Type and Chunk Size

The SCHEDULE clause of the DO or PARALLEL DO directive specifies a scheduling algorithm that determines how iterations of the DO loop are divided among and dispatched to the threads of the team. The SCHEDULE clause applies only to the current DO or PARALLEL DO directive.

Within the SCHEDULE clause, you must specify a schedule type and, optionally, a chunk size. A chunk is a contiguous group of iterations dispatched to a thread. Chunk size must be a scalar integer expression.

The following list describes the schedule types and how the chunk size affects scheduling:

STATIC
The iterations are divided into pieces having a size specified by chunk. The pieces are statically dispatched to threads in the team in a round-robin manner in the order of thread number.
When chunk is not specified, the iterations are first divided into contiguous pieces by dividing the number of iterations by the number of threads in the team. Each piece is then dispatched to a thread before loop execution begins.
DYNAMIC
The iterations are divided into pieces having a size specified by chunk. As each thread finishes its currently dispatched piece of the iteration space, the next piece is dynamically dispatched to the thread.
When no chunk is specified, the default is 1.
GUIDED
The chunk size is decreased exponentially with each succeeding dispatch. Chunk specifies the minimum number of iterations to dispatch each time. If there are less than chunk number of iterations remaining, the rest are dispatched.
When no chunk is specified, the default is 1.
RUNTIME
The decision regarding scheduling is deferred until run time. The schedule type and chunk size can be chosen at run time by using the OMP_SCHEDULE environment variable (see Table 6-4).
When you specify RUNTIME, you cannot specify a chunk size.

The following list shows which schedule type is used, in priority order:

The schedule type specified in the SCHEDULE clause of the current DO or PARALLEL DO directive
If the schedule type for the current DO or PARALLEL DO directive is RUNTIME, the default value specified in the OMP_SCHEDULE environment variable
The compiler default schedule type of STATIC

The following list shows which chunk size is used, in priority order:

The chunk size specified in the SCHEDULE clause of the current DO or PARALLEL DO directive
For RUNTIME schedule type, the value specified in the OMP_SCHEDULE environment variable
For DYNAMIC and GUIDED schedule types, the default value 1
If the schedule type for the current DO or PARALLEL DO directive is STATIC, the loop iteration space divided by the number of threads in the team

6.2 Compaq Fortran Parallel Compiler Directives

These directives are provided for compatibility with older programs that were written for parallel execution.

These topics are described:

Command-line option and directives format (see Section 6.2.1)
Directive summary descriptions (see Section 6.2.2)
Parallel processing thread model (see Section 6.2.3)
Privatizing named common blocks (see Section 6.2.4)
Controlling data scope attributes (see Section 6.2.5)
Parallel region (see Section 6.2.6)
Worksharing constructs (see Section 6.2.7)
Combined parallel/worksharing constructs (see Section 6.2.8)
Synchronization constructs (see Section 6.2.9)
Specifying a default chunk size (see Section 6.2.10)
Specifying a default schedule type (see Section 6.2.11)
Terminating loop execution early (see Section 6.2.12)

6.2.1 Command-Line Option and Directives Format

To use Compaq Fortran parallel compiler directives in your program, you must include the -mp compiler option on your f90 command:

% f90 -mp prog.f -o prog

The format of a Compaq Fortran parallel compiler directive is:

prefix directive_name [option[[,] option]...]

All Compaq Fortran parallel compiler directives must begin with a directive prefix. Directives are not case-sensitive. Options can appear in any order after the directive name and can be repeated as needed, subject to the restrictions of individual options.

Directives cannot be embedded within continued statements, and statements cannot be embedded within directives. Trailing comments are allowed.

6.2.1.1 Directive Prefixes

The directive prefix you use depends on the source form you use in your program:

Use the !$PAR prefix when compiling either fixed source form or free source form programs.
Use the C$PAR (or c$PAR) and the *$PAR prefixes only when compiling fixed source form programs.

Fixed Source Form

For fixed source form programs, the prefix is one of the following:

!$PAR
C$PAR (or c$PAR)
*$PAR

For four directives, there is another form for fixed source form programs. This nonpreferred form is accepted by the compiler for compatibility reasons. The four directives are: CHUNK, COPYIN, DOACROSS, and MP_SCHEDULE. The prefix is c$. Thus, these four directives are acceptable:

Preferred Directive Name Acceptable Directive Name

!$PAR CHUNK c$CHUNK

!$PAR COPYIN c$COPYIN

!$PAR DOACROSS c$DOACROSS

!$PAR MP_SCHEDULE c$MP_SCHEDULE

Preferred Directive Name	Acceptable Directive Name
!$PAR CHUNK	c$CHUNK
!$PAR COPYIN	c$COPYIN
!$PAR DOACROSS	c$DOACROSS
!$PAR MP_SCHEDULE	c$MP_SCHEDULE

For More Information:

See Section 6.1.1.1, Directive Prefixes.

Free Source Form

For free source form programs, the prefix is !$PAR.

For More Information:

See Section 6.1.1.1, Directive Prefixes.

6.2.2 Summary Descriptions of Compaq Fortran Parallel Compiler Directives

Table 6-3 provides summary descriptions of the Compaq Fortran parallel compiler directives. For complete information about the Compaq Fortran parallel compiler directives, see the Compaq Fortran Language Reference Manual.

Table 6-3 Compaq Fortran Parallel Compiler Directives
Directive
Format Description

prefix BARRIER

This directive defines a synchronization construct that synchronizes all the threads in a team.
See Section 6.2.9.1, BARRIER Directive.

prefix CHUNK = chunksize

This directive sets a default chunk size used to divide iterations among the threads of the team.
See Section 6.2.10, Specifying a Default Chunk Size.

prefix COPYIN object[, object]...

This data environment directive specifies that the listed variables, single array elements, and common blocks be copied from the master thread to the PRIVATE data objects having the same name.
Single array elements can be copied, but array sections cannot be copied.
Shared variables cannot be copied.
When an allocatable array is to be copied, it must be allocated when the COPYIN directive is encountered.
This directive is allowed only within PARALLEL and PARALLEL DO directives.

prefix CRITICAL SECTION [(latch-var)]

code

prefix END CRITICAL SECTION

These directives define a synchronization construct that specifies a block of code that is executed by one thread at a time.
See Section 6.2.9.2, CRITICAL SECTION and END CRITICAL SECTION Directives.

prefix INSTANCE

SINGLE
PARALLEL
/com-blk-name/[[,]/com-blk-name/]...

This data environment directive makes named common blocks available to threads.
See Section 6.2.4, Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives.

prefix MP_SCHEDTYPE = mode

This directive sets a default run-time schedule type.
See Section 6.2.11, Specifying a Default Schedule Type.

prefix PARALLEL [region-option[[,]region-option]...]

code

prefix END PARALLEL

These directives define a parallel construct that is a region of a program that must be executed by a team of threads in parallel until the END PARALLEL directive is encountered.
See Section 6.2.6, Parallel Region: PARALLEL and END PARALLEL Directives .

prefix

PARALLEL DO
DOACROSS
[par-do-option[[,]par-do-option]...]

do_loop

[prefix END PARALLEL DO]

These directives define a combined parallel/worksharing construct that specifies an abbreviated form of specifying a parallel region that contains a single PDO directive.
See Section 6.2.8.1, PARALLEL DO and END PARALLEL DO Directives.

prefix PARALLEL SECTIONS [par-sect-option[[,]par-sect-option]...]

code

prefix END PARALLEL SECTIONS

These directives define a combined parallel/worksharing construct that specifies an abbreviated form of specifying a parallel region that contains a single SECTION directive. The semantics are identical to explicitly specifying the PARALLEL directive immediately followed by a PSECTIONS directive.
See Section 6.2.8.2, PARALLEL SECTIONS and END PARALLEL SECTIONS Directives.

prefix PDO [pdo-option[[,]pdo-option]...]

do_loop

[prefix END PDO [NOWAIT]]

These directives define a worksharing construct that specifies that each set of iterations of the contained DO LOOP is a unit of work that can be scheduled on a single thread.
See Section 6.2.7.1, PDO and END PDO Directives.

prefix PDONE

This directive specifies that the DO loop in which the PDONE directive is contained should be terminated early.
See Section 6.2.12, Terminating Loop Execution Early: PDONE Directive.

prefix PSECTION[S] [sect-option[[,]sect-option]...]

[prefix SECTION]

code

[prefix SECTION

code ]

prefix END PSECTION[S] [NOWAIT]

These directives define a worksharing construct that specifies that the enclosed sections of code are to be divided among threads in the team.
See Section 6.2.7.2, PSECTIONS, SECTION, and END PSECTIONS Directives.

prefix SINGLE PROCESS [proc-option[[,]proc-option] ...]

code

prefix END SINGLE PROCESS [NOWAIT]

These directives define a worksharing construct that specifies a block of code that is executed by only one thread.
See Section 6.2.7.3, SINGLE PROCESS and END SINGLE PROCESS Directives.

prefix TASKCOMMON com-blk-name[,com-blk-name]...

This data environment directive makes named common blocks private to a thread, but global within the thread.
See Section 6.2.4, Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives.

6.2.3 Parallel Processing Thread Model

The concepts of the parallel processing thread model are the same as those for OpenMP Fortran API with one exception: orphaned directives are not possible with Compaq Fortran parallel compiler directives.

For More Information:

See Section 6.1.3, Parallel Processing Thread Model.

You can control the data environment within parallel and worksharing constructs. Using directives and data environment options on directives, you can:

Privatize named common blocks (see Section 6.2.4)
Control data scope attributes (see Section 6.2.5)

6.2.4 Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives

You can make named common blocks private to a thread, but global within the thread by using the TASKCOMMON or the INSTANCE PARALLEL directive:

For TASKCOMMON, specify a comma-separated list of common block names
For INSTANCE PARALLEL, specify a comma-separated list of common block names, each enclosed by slashes ( /name/ )

The TASKCOMMON and INSTANCE PARALLEL directives are semantically equivalent and differ only in form.

Only named common blocks can be made thread private.

Each thread gets its own copy of the common block, with the result that data written to the common block by one thread is not directly visible to other threads. During serial portions of the program, accesses are to the master thread copy of the common block.

You should assume that the data in thread private common blocks is undefined on entry into the first parallel region unless you specified the COPYIN option in the PARALLEL directive (see COPYIN Option in Section 6.2.5).

When you make thread private a common block that is initialized using DATA statements, the copy of the common block for each thread has that initial value. If no initial value is provided, the variables in the common block are assigned the value of zero.

You can also specify INSTANCE SINGLE, which is the default in the absence of any attribute for the directive. In this case, all threads share the same copy of the common block in the master thread. Assignments made by one thread affect the copy in all other threads.

When you specify INSTANCE PARALLEL, the named common blocks are made private to a thread, but global within the thread.

The TASKCOMMON directive is the same as the OpenMP Fortran API THREADPRIVATE directive except that slashes ( // ) do not have to be used to delimit named common blocks.

For More Information:

See Section 6.1.4, Privatizing Named Common Blocks: THREADPRIVATE Directive.

6.2.5 Controlling Data Scope Attributes

You can use several options to control the data scope attributes of variables for the duration of the construct in which you specify them. If you do not specify a data scope attribute option on a directive, the default is SHARED for those variables affected by the directive.

Each of the data scope attribute options accepts a list, which is a comma-separated list of named variables or named common blocks that are accessible in the scoping unit. When you specify named common blocks, they must appear between slashes ( /name/ ).

Not all of the options are allowed on all directives, but the directives to which each option applies are listed in the clause descriptions.

The data scope attribute options are:

COPYIN
DEFAULT
FIRSTPRIVATE
LASTLOCAL or LAST LOCAL
PRIVATE or LOCAL
REDUCTION
SHARED or SHARE

COPYIN Option

Use the COPYIN option on the PARALLEL, PARALLEL DO, and PARALLEL SECTIONS directives to copy named common block values from the master thread copy to threads at the beginning of a parallel region, use the COPYIN option on the PARALLEL directive. The COPYIN option applies only to named common blocks that have been previously declared thread private using the TASKCOMMON or the INSTANCE PARALLEL directive (see Section 6.2.4).

Use a comma-separated list to name the common blocks and variables in common blocks you want to copy.

DEFAULT Option

This option is the same as the OpenMP Fortran API DEFAULT clause (see Section 6.1.5).

FIRSTPRIVATE Option

This option is the same as the OpenMP Fortran API FIRSTPRIVATE clause (see Section 6.1.5).

LASTLOCAL or LAST LOCAL Option

Except for differences in directive name spelling, the LASTLOCAL or LAST LOCAL option is the same as the OpenMP Fortran API LASTPRIVATE clause (see Section 6.1.5).

PRIVATE or LOCAL Option

Except for the alternate directive spelling of LOCAL, the PRIVATE or LOCAL option is the same as the OpenMP Fortran API PRIVATE clause (see Section 6.1.5).

REDUCTION Option

Use the REDUCTION option on the PDO directive to declare variables that are to be the object of a reduction operation. Use a comma-separated list to name the variables you want to declare as objects of a reduction.

The REDUCTION option in the Compaq Fortran parallel compiler directive set is different from the REDUCTION clause in the OpenMP Fortran API directive set. In the OpenMP Fortran API directive set, both a variable and an operator type are given. In the Compaq Fortran parallel compiler directive set, the operator is not given. The compiler must be able to determine the reduction operation from the source code. The REDUCTION option can be applied to a variable in a DO loop only if the variable meets the following criteria:

It must be scalar.
It must be assigned to exactly once in the DO loop.
It must be read from exactly once in the DO loop and only in the right side of the assignment.
The assignment must be one of the following forms:
x = x operator expr
x = expr operator x (except for subtraction)
x = operator(x, expr)
x = operator(expr, x)
where operator is one of the following supported reduction operations: +, -, *, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, or IOR.

The compiler rewrites the reduction operation by computing partial results into local variables and then combining the results into the reduction variable. The reduction variable must be SHARED in the enclosing context.

SHARED or SHARE Option

Except for the alternate directive spelling of SHARE, the SHARED or SHARE option is the same as the OpenMP Fortran API SHARED clause (see Section 6.1.5).

6.2.6 Parallel Region: PARALLEL and END PARALLEL Directives

The concepts of using a parallel region are the same as those for OpenMP Fortran API (see Section 6.1.6), with these differences:

Use the worksharing directives such as DO, SECTIONS, and SINGLE to divide the statements in the parallel region into units of work and to distribute those units so that each unit is executed by one thread.
The environment variable you use to set the default number of threads is MP_THREAD_COUNT and the run-time library routine is OtsSetNumThreads .

For More Information:

6.2.7 Worksharing Constructs

A worksharing construct must be enclosed lexically (not dynamically, as with OpenMP Fortran API directives) within a parallel region if the worksharing directive is to execute in parallel. No new threads are launched and there is no implied barrier on entry to a worksharing construct.

The worksharing constructs are:

PDO and END PDO directives (see Section 6.2.7.1)
PSECTIONS, SECTION, and END PSECTIONS directives (see Section 6.2.7.2)
SINGLE PROCESS and END SINGLE PROCESS directives (see Section 6.2.7.3)

6.2.7.1 PDO and END PDO Directives

The PDO directive specifies that the iterations of the immediately following DO loop must be dispatched across the team of threads so that each iteration is executed in parallel by a single thread.

The loop that follows a PDO directive cannot be a DO WHILE or a DO loop that does not have loop control. The iterations of the DO loop are divided among and dispatched to the existing threads in the team.

You cannot use a GOTO statement, or any other statement, to transfer control into or out of the PDO construct.

This directive must be nested within the lexical extent of a PARALLEL directive. The PARALLEL directive takes an optional comma-separated list of options that specifies:

Whether variables are PRIVATE, FIRSTPRIVATE, LASTLOCAL, or REDUCTION
How iterations are scheduled onto threads and whether this is deferred until run time (MP_SCHEDTYPE option)
How many iterations each thread is assigned (CHUNK or BLOCKED option)
Whether iterations are in an ordered sequence (ORDERED option)

If you specify the optional END PDO directive, it must appear immediately after the end of the DO loop. If you do not specify the END PDO directive, an END PDO directive is assumed at the end of the DO loop.

If you do not specify the optional NOWAIT clause on the END PDO directive, threads synchronize at the END PDO directive. If you specify NOWAIT, threads do not synchronize at the END PDO directive. Threads that finish early proceed directly to the instructions following the END PDO directive.

You can use the ORDERED option to affect the way threads are dispatched. When you specify this option, iterations are dispatched to threads in the same order they would be for sequential execution.

The PDO directive optionally lets you:

Control data scope attributes (see Section 6.1.5)
Specify chunk size (see Section 6.2.10)
Specify schedule type (see Section 6.2.11)
Terminate loop execution early (see Section 6.2.12)
Override implicit synchronization

Overriding Implicit Synchronization

Whether or not you include the END PDO directive at the end of the DO loop, by default an implicit synchronization point exists immediately after the last statement in the loop. Threads reaching this point wait until all threads complete their work and reach this synchronization point.

If there are no data dependences between the variables inside the loop and those outside the loop, there may be no reason to make threads wait. In this case, use the NOWAIT clause on the END PDO directive to override synchronization and allow threads to continue.

6.2.7.2 PSECTIONS, SECTION, and END PSECTIONS Directives

This directive is the same as the OpenMP Fortran API SECTIONS directive with the following exceptions:

The names are PSECTIONS and END PSECTIONS.
No REDUCTION clause or LASTPRIVATE clause is permitted.
LOCAL is permitted as an alternative spelling for the PRIVATE clause.

For More Information:

See Section 6.1.7.2, SECTIONS, SECTION, and END SECTIONS Directives.

6.2.7.3 SINGLE PROCESS and END SINGLE PROCESS Directives

This directive is the same as the OpenMP Fortran API SINGLE directive with the following exceptions:

The names are SINGLE PROCESS and END SINGLE PROCESS.
LOCAL is permitted as an alternative spelling for the PRIVATE clause.

For More Information:

See Section 6.1.7.3, SINGLE and END SINGLE Directives.

6.2.8 Combined Parallel/Worksharing Constructs

The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:

PARALLEL DO (see Section 6.2.8.1)
PARALLEL SECTIONS (see Section 6.2.8.2)

6.2.8.1 PARALLEL DO and END PARALLEL DO Directives

This directive is the same as the OpenMP Fortran API PARALLEL DO directive with the following exceptions:

You can use the alternate DOACROSS directive name instead of PARALLEL DO. For compatibility, the following form is also allowed (for fixed source form only): c$DOACROSS.
The options can be one or more of the options for the PARALLEL and PDO directives.

The PARALLEL DO directive optionally lets you:

Control data scope attributes (see Section 6.1.5)
Specify chunk size (see Section 6.2.10)
Specify schedule type (see Section 6.2.11)
Terminate loop execution early (see Section 6.2.12)

For More Information:

See Section 6.1.8.1, PARALLEL DO and END PARALLEL DO Directives.

6.2.8.2 PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

This directive is the same as the OpenMP Fortran API PARALLEL SECTIONS directive with the following exception: The options can be one or more of the options for the PARALLEL and PSECTIONS directives, instead of the PARALLEL and SECTIONS directives.

The semantics are identical to explicitly specifying the PARALLEL directive immediately followed by a SECTIONS directive.

For More Information:

See Section 6.1.8.2, PARALLEL SECTIONS and END PARALLEL SECTIONS Directives.

6.2.9 Synchronization Constructs

Synchronization constructs are used to assure the consistency of shared data and to coordinate parallel execution among threads.

The synchronization constructs are:

BARRIER directive (see Section 6.2.9.1)
CRITICAL SECTION directive (see Section 6.2.9.2)

6.2.9.1 BARRIER Directive

The BARRIER directive is the same as the OpenMP Fortran API BARRIER directive (see Section 6.1.9.2).

6.2.9.2 CRITICAL SECTION and END CRITICAL SECTION Directives

The CRITICAL SECTION and END CRITICAL SECTION directives are the same as the OpenMP Fortran API CRITICAL and END CRITICAL directives with the following exceptions:

The directive names are CRITICAL SECTION and END CRITICAL SECTION.
You can specify an optional latch variable name.
If you do not specify a latch variable name, the compiler assigns a unique name.
The END CRITICAL SECTION directive does not take a latch variable name.
You must explicitly initialize a latch variable to zero before any critical section using that latch variable is executed.
You must not reuse that latch variable in anything other than a critical section until all uses as a latch variable are complete.

For More Information:

See Section 6.1.9.3, CRITICAL and END CRITICAL Directives.

6.2.10 Specifying a Default Chunk Size

A chunk is a contiguous group of iterations dispatched to a thread. You can explicitly define a chunk size for a PDO or PARALLEL DO directive by using the CHUNK or BLOCKED option. Chunk size must be a scalar integer expression. The specified chunk size applies only to the current PDO or PARALLEL DO directive.

The following list shows which chunk size is used, in priority order:

The chunk size specified in the CHUNK or BLOCKED option of the current PDO or PARALLEL DO directive.
The value specified in the most recent CHUNK directive. (The CHUNK directive is provided for compatibility reasons.)
If the schedule type for the current PDO or PARALLEL DO directive is either INTERLEAVED, DYNAMIC, GUIDED, or RUNTIME, the chunk size default value specified in the MP_CHUNK_SIZE environment variable.
The compiler default chunk size value of 1.

The interaction between the chunk size and the schedule type are:

For the DYNAMIC and INTERLEAVED schedule types, iterations are always dispatched to threads in chunk size groups. If the total number of iterations is not evenly divisible by chunk size, the last group dispatched has fewer iterations.
For the GUIDED schedule type, chunk size is the minimum number of iterations that can be dispatched to a thread. If less than chunk size iterations remain, the remaining iterations are dispatched to the next available thread.
For the STATIC schedule type, chunk size is ignored.

6.2.11 Specifying a Default Schedule Type

The schedule type specifies a scheduling algorithm that determines how chunks of loop iterations are dispatched to the threads of a team. The schedule type does not affect the semantics of the program, but might affect performance. You can explicitly define a run-time schedule type for the current PDO or PARALLEL DO directive by using the MP_SCHEDTYPE option. The specified schedule type applies to the current PDO or PARALLEL DO directive only.

The following list shows which schedule type is used, in priority order:

The schedule type specified in the MP_SCHEDTYPE option of the current PDO or PARALLEL DO directive.
The schedule type specified in the most recent MP_SCHEDTYPE directive. (The MP_SCHEDTYPE directive is provided for compatibility reasons.)
If the schedule type for the current PDO or PARALLEL DO directive is RUNTIME, the default value specified in the MP_SCHEDTYPE environment variable.
The compiler default schedule type of STATIC.

The following list describes the schedule types and how the chunk size affects scheduling:

For the STATIC or SIMPLE schedule types, one contiguous group of iterations is dispatched to each thread, with each group having approximately the same number of iterations.
For the INTERLEAVED or INTERLEAVE schedule types, a chunk-sized group of iterations is dispatched to each thread in a round-robin manner.
For the DYNAMIC schedule type, a chunk-sized group of the remaining iterations is dispatched to the next available thread. If less than one chunk size of iterations remain, all the remaining iterations are dispatched.
For the GUIDED or GSS schedule types (similar to the DYNAMIC schedule type), the number of iterations dispatched is relatively large at the beginning of the loop and decreases exponentially. The number of iterations dispatched is not necessarily evenly divisible by chunk size.
The specified chunk size is the minimum number of iterations that can be dispatched when a thread becomes available. When the number of remaining iterations is less than or equal to chunk size, all of the remaining iterations are dispatched to the next available thread.
In some cases, setting a chunk size greater than 1 improves execution efficiency as the loop nears termination. This is because contention between threads for the small number of remaining iterations is reduced.
For the RUNTIME schedule type, the schedule type and the chunk size are those specified in the MP_SCHEDTYPE environment variable.

The DYNAMIC and GUIDED schedule types introduce some amount of overhead required to manage the continuing dispatching of iterations to threads. However, this overhead is sometimes offset by better load balancing when the average execution time of iterations is not uniform throughout the loop.

The STATIC and INTERLEAVED schedule types dispatch all of the iterations to the threads in advance, with each thread receiving approximately equal numbers of iterations. One of these types is usually the most efficient schedule type when the average execution time of iterations is uniform throughout the loop.

6.2.12 Terminating Loop Execution Early: PDONE Directive

If you want to terminate loop execution early because a specified condition has been satisfied, use the PDONE directive. This is an executable directive and any undispatched iterations are not executed. However, all previously dispatched iterations are completed.

This directive must be nested within the lexical extent of a PDO or PARALLEL DO directive.

When the schedule type is STATIC or INTERLEAVED, this directive has no effect because all loop iterations are dispatched before the DO loop executes.

6.3 Decomposing Loops for Parallel Processing

Note

This section contains information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives. The code examples use the OpenMP API directive format.

To run in parallel, the source code in iterated DO loops must be decomposed by the user, and adequate system resources must be made available. Decomposition is the process of analyzing code for data dependences, dividing up the workload, and ensuring correct results when iterations run concurrently.

The term loop decomposition is used to specify the process of dividing the iterations of an iterated DO loop and running them on two or more threads of a shared-memory multi-processor computer system.

The only type of decomposition available with Compaq Fortran is directed decomposition using a set of parallel compiler directives.

The following sections describe how to decompose loops and how to use the OpenMP Fortran API and the Compaq Fortran parallel compiler directives to achieve parallel processing.

6.3.1 Steps in Using Directed Decomposition

When a program is compiled using the -omp or the -mp option, the compiler parses the parallel compiler directives. However, you must transform the source code to resolve any loop-carried dependences and improve run-time performance. (Another method of supporting parallel processing does not involve iterated DO loops. Instead, it allows large amounts of independent code to be run in parallel using the SECTIONS and SECTION directives.)

To use directed decomposition effectively, take the following steps:

Identify the loops that benefit most from parallel processing:
- Consider whether another algorithm might achieve more parallelism in general.
- Evaluate any caller or called loops and decompose the most CPU-intensive loops in the application (as long as there are no interfering dependences).
  If a parallel DO loop invokes a subprogram containing another parallel DO loop, only the parallel DO loop of the calling program will be run in parallel. Each of the threads executing the outermost parallel DO loop will execute all of the iterations in the innermost parallel DO loop in a serial, nonparallel fashion.
- Make sure the loop contains enough CPU work to outweigh the parallel-processing startup overhead.
Analyze the loop and resolve dependences as needed. (See Section 6.3.2, Resolving Dependences Manually.) If you cannot resolve loop-carried dependences, you cannot safely decompose the loop.
Make sure the shared or private attributes inside the loop are consistent with corresponding use outside the loop. By default, common blocks and individual variables are shared, except for the loop control variable and variables referenced in a subprogram called from within a parallel loop (in which case they are private by default).
Precede the loop with the PARALLEL directive followed by the DO directive. You can combine the two directives by using the PARALLEL DO directive.
As needed, manually optimize the loop.
Make sure the loop complies with restrictions of the parallel-processing environment.
Without using the -omp option or the -mp option, compile, test, and debug the program.
Using -omp (or -mp ), repeat the previous step.
Evaluate the parallel run:
- If you reach an acceptable level of performance and if the results are correct, stop.
- If the results are inaccurate, analyze the manually decomposed loops for dependences, apply a method to resolve them, and retest the parallel run.
- If performance is inadequate, consider adjusting the run-time environment (see Section 6.4) or performing other manual optimizations, or consider other alternatives discussed in this manual. Then reenter the cycle by retesting the parallel program.

6.3.2 Resolving Dependences Manually

In directed decomposition, you must resolve loop-carried dependences and dependences involving temporary variables to ensure safe parallel execution. Only cycles of dependences are nearly impossible to resolve.

Do one of the following:

Let the loop execute serially (possibly decompose an outer loop level)
Use a lock (CRITICAL) to force the critical section to execute serially
Recode or restructure the loop
Find another algorithm that does not have cycles of dependences

There are several methods for resolving dependences manually:

For dependences on variables used as temporaries, declare them PRIVATE; this effectively makes separate copies of temporary values for each thread.
Recode the loop so that the loop-carried dependence becomes loop independent, with each thread having the involved store and fetch operation contained in a single iteration.
Insert locks (CRITICAL) around the critical section containing the dependence.
Use this technique only for very CPU-intensive loops, when no other method is possible, and for the smallest amount of code possible. The locks extend processing time by making individual threads wait while only one executes the critical region at a time.
Recode loops with cycles of dependences (these are typically linear recurrences).

6.3.2.1 Resolving Dependences Involving Temporary Variables

Declare temporary variables PRIVATE to resolve dependences involving them. Temporary variables are used in intermediate calculations. If they are used in more than one iteration of a parallel loop, the program can produce incorrect results.

One thread might define a value and another thread use that value instead of the one it defined for a particular iteration. Loop control variables are prime examples of temporary variables that are declared PRIVATE by default within a parallel region. For example:

DO I = 1,100 TVAR = A(I) + 2 D(I) = TVAR + Y(I-1) END DO

As long as certain criteria are met, you can resolve this kind of dependence by declaring the temporary variable (TVAR, in the example) PRIVATE. That way, each thread keeps its own copy of the variable.

For the criteria to be met, the values of the temporary variable must be all of the following:

Defined in each iteration, inside the loop
Meant to be used inside the same iteration that established it
Used nowhere outside the loop unless it is redefined outside the loop before subsequent use

The default for variables in a parallel loop is SHARED, so you must explicitly declare these variables PRIVATE to resolve this kind of dependence.

6.3.2.2 Resolving Loop-Carried Dependences

You can often resolve loop-carried dependences using one or more of the following loop transformations:

Loop alignment
Code replication
Loop distribution
Restructure the loop into an inner and outer loop

These techniques also resolve dependences that inhibit autodecomposition.

6.3.2.3 Loop Alignment

Loop alignment offsets memory references in the loop so that the dependence is no longer loop carried. The following example shows a loop that is aligned to resolve the dependence in array A:

Loop with Dependence Aligned Statements

DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO
C(I-1) = A(I) A(I) = B(I)

Loop with Dependence	Aligned Statements
`DO I = 2,N A(I) = B(I) C(I) = A(I+1) END DO`	`C(I-1) = A(I) A(I) = B(I)`

To compensate for the alignment and achieve the same calculations as the original loop, you probably have to perform one or more of the following:

Change the loop control variable.
Add IF constructs.
Switch the order of the statements (this preserves the relative store-fetch order of the original loop).

Example 6-1 shows two possible forms of the final loop.

Example 6-1 Aligned Loop

! First possible form: !$OMP PARALLEL PRIVATE (I) !$OMP DO DO I = 2,N+1 IF (I .GT. 2) C(I-1) = A(I) IF (I .LE. N) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL ! ! Second possible form; more efficient because the tests are ! performed outside the loop: ! !$OMP PARALLEL !$OMP DO DO I = 3,N C(I-1) = A(I) A(I) = B(I) END DO !$OMP END DO !$OMP END PARALLEL IF (N .GE. 2) A(2) = B(2) C(N) = A(N+1) END IF

6.3.2.4 Code Replication

When a loop contains a loop-independent dependence as well as a loop-carried dependence, loop alignment alone is usually not adequate. By resolving the loop-carried dependence, you often misalign another dependence. Code replication creates temporary variables that duplicate operations and keep the loop-independent dependences inside each iteration.

In S₂ of the following loop, aligning the A(I-1) reference without code replication would misalign the A(I) reference:

Loop with Multiple Dependences Misaligned Dependence

DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO
D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)

Loop with Multiple Dependences	Misaligned Dependence
`DO I = 2,100 S ₁ A(I) = B(I) + C(I) S ₂ D(I) = A(I) + A(I-1) END DO`	`D(I-1) = A(I-1) + A(I) A(I) = B(I) + C(I)`

Example 6-2 uses code replication to keep the loop-independent dependence inside each iteration. The temporary variable, TA, must be declared PRIVATE.

Example 6-2 Transformed Loop Using Code Replication

!$OMP PARALLEL PRIVATE (I,TA) A(2) = B(2) + C(2) D(2) = A(2) + A(1) !$OMP DO DO I = 3,100 A(I) = B(I) + C(I) TA = B(I-1) + C(I-1) D(I) = A(I) + TA END DO !$OMP END DO !$OMP END PARALLEL

6.3.2.5 Loop Distribution

Loop distribution allows more parallelism when neither loop alignment nor code replication can resolve the dependences. Loop distribution divides the contents of loops into multiple loops so that dependences cross between two separate loops. The loops run serially in relation to each other, even if they both run in parallel.

The following loop contains multiple dependences that cannot be resolved by either loop alignment or code replication:

DO I = 1,100 S₁ A(I) = A(I-1) + B(I) S₂ C(I) = B(I) - A(I) END DO

Example 6-3 resolves the dependences by distributing the loop. S₂ can run in parallel despite the data recurrence in S₁.

Example 6-3 Distributed Loop

DO I 1,100 S₁ A(I) = A(I-1) + B(I) END DO DO I 1,100 S₂ C(I) = B(I) - A(I) END DO

6.3.2.6 Restructuring a Loop into an Inner and Outer Nest

Restructuring a loop into an inner and outer loop nest can resolve some recurrences that are used as rapid approximations of a function of the loop control variable. For example, the following loop uses sines and cosines:

THETA = 2.*PI/N DO I=0,N-1 S = SIN(I*THETA) C = COS(I*THETA) . . ! use S and C . END DO

Using a recurrence to approximate the sines and cosines can make the serial loop run faster (with some loss of accuracy), but it prevents the loop from running in parallel:

THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) S = 0.0 C = 1.0 DO I=0,N-1 . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO

To resolve the dependences, substitute the SIN and COS calls. (However, this loses the performance improvement gained from using the recurrence.) You can also restructure the loop into an outer parallel loop and an inner serial loop. Each iteration of the outer loop reinitializes the recurrence, and the inner loop uses the value:

!$OMP PARALLEL SHARED (THETA,STH,CTH,LCHUNK) PRIVATE (ISTART,I,S,C) THETA = 2.*PI/N STH = SIN(THETA) CTH = COS(THETA) LCHUNK = (N + NWORKERS()-1) / NWORKERS !$OMP DO DO ISTART = 0,N-1,LCHUNK S = SIN(ISTART*THETA) C = COS(ISTART*THETA) DO I = ISTART, MIN(N,ISTART+LCHUNK-1) . . ! use S and C . S = S*CTH + C*STH C = C*CTH - S*STH END DO END DO !$OMP END DO !$OMP END PARALLEL

6.3.2.7 Dependences Requiring Locks

When no other method can resolve a dependence, you can put locks around the critical section that contains them. Locks force threads to execute the critical section serially, while allowing the rest of the loop to run in parallel.

However, locks degrade performance because they force the critical section to run serially and increase the overhead. They are best used only when no other technique resolves the dependence, and only in CPU-intensive loops.

To create locks in a loop, enclose the critical section between the CRITICAL and END CRITICAL directives. When a thread executes the CRITICAL directive and the latch variable is open, it takes possession of the latch variable, and other threads must wait to execute the section. The latch variable becomes open when the thread executing the section executes the END CRITICAL directive.

The latch variable is closed when a thread has possession of it and open when the latch variable is free.

In Example 6-4, the statement updating the sum is locked for safe parallel execution of the loop.

Example 6-4 Decomposed Loop Using Locks

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO DO I = 1,1000 Y = some_calculation !$OMP CRITICAL (LCK) SUM = SUM + Y !$OMP END CRITICAL (LCK) END DO !$OMP END DO !$OMP END PARALLEL

This particular example is better solved using a REDUCTION clause as shown in Example 6-5.

Example 6-5 Decomposed Loop Using a REDUCTION Clause

INTEGER(4) LCK !$OMP PARALLEL PRIVATE (I,Y) SHARED (LCK,SUM) LCK = 0 . . . !$OMP DO REDUCTION (SUM) DO I = 1,1000 Y = some_calculation SUM = SUM + Y END DO !$OMP END DO !$OMP END PARALLEL

6.3.3 Coding Restrictions

Because iterations in a parallel DO loop execute in an indeterminate order and in different threads, certain constructs in these loops can cause unpredictable run-time behavior.

The following restrictions are flagged:

The loop control variable for a parallel loop must be declared an integer.
Only comment lines and blank lines can exist between a DO directive and the DO loop statement.
The loop body must not contain any RETURN statements.
A loop with a branch (GOTO) into or out of its body, or having an EXIT statement cannot be run in parallel.

The following restrictions are not flagged:

Loop-carried dependences involving shared variables must not exist between iterations of a parallel loop.
Dependences involving private variables must not exist between code within a parallel loop and code executed before entry into or after the completion of the loop.
System services or run-time library routines that change the context of a thread (such as a change in privileges, priority, access mode, or environment variables) must not be called from within a parallel loop.
I/O statements and the control statements PAUSE and STOP must not be used in a routine called at any call level from within a parallel loop.
Private symbols must not be referenced in a SAVE statement in a routine called at any call level from within a parallel loop.
If a dummy argument is referenced within a parallel DO loop, the corresponding actual argument must reside in shared memory.
Random number generators must be used carefully inside parallel loops, because parallel processing affects how numbers are generated.

6.3.4 Manual Optimization

To manually optimize structures containing parallel loops:

Interchange loops so that the parallel loop has the most CPU work and the caches can perform efficiently.
Balance the parallel work among threads when it is unusually unbalanced.

6.3.4.1 Interchanging Loops

The following example shows a case in which an inner loop can run in parallel and an outer loop cannot, because of a loop-carried dependence. The inner loop also has a more effective memory-referencing pattern for parallel processing than the outer loop. By interchanging the loops, more work executes in parallel and the cache can perform more efficiently.

Original Structure Interchanged Structure

!$OMP PARALLEL PRIVATE (J,I) SHARED (A)

!$OMP DO

DO I = 1,100 DO J = 1,300

DO J = 1,300 DO I = 1,100

A(I,J) = A(I+1,J) + 1 A(I,J) = A(I+1,J) + 1

END DO END DO

END DO END DO

!$OMP END DO

!$OMP END PARALLEL

6.3.4.2 Balancing the Workload

On the DO directive, you can specify the SCHEDULE(GUIDED) clause to use guided self-scheduling in manually decomposed loops, which is effective for most loops. However, when the iterations contain a predictably unbalanced workload, you can obtain better performance by manually balancing the workload. To do this, specify the chunk size in the SCHEDULE clause of the DO directive.

In the following loop, it might be very inefficient to divide the iterations into chunks of 50. A chunk size of 25 would probably be much more efficient on a system with two processors, depending on the amount of work being done by the routine SUB.

DO I = 1,100 . . . IF (I .LT. 50) THEN CALL SUB(I) END IF . . . END DO

6.4 Environment Variables for Adjusting the Run-Time Environment

Note

This section contains information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

The OpenMP Fortran API and the Compaq Fortran parallel compiler directive sets also provide environment variables that adjust the run-time environment in unusual situations.

Regardless of whether you used the -omp or the -mp compiler option, when the compiler needs information supplied by an environment variable, the compiler first looks for an OpenMP Fortran API environment variable and then for a Compaq Fortran parallel compiler environment variable. If neither one is found, the compiler uses a default.

The compiler looks for environment variable information in the following situations:

When entering a parallel region, it looks for the number of threads ( OMP_NUM_THREADS or MP_THREAD_COUNT ), the spin count ( MP_SPIN_COUNT ), the yield count ( MP_YIELD_COUNT ), and the stack size ( MP_STACK_SIZE ).
When entering a DO or PARALLEL DO directive that has RUNTIME specified, it looks at schedule type ( OMP_SCHEDULE ).
When entering a worksharing directive, it looks at chunk size ( MP_CHUNK_SIZE ).

The OpenMP Fortran API environment variables are listed in Table 6-4.

Table 6-4 OpenMP Fortran API Environment Variables
Environment Variable¹ Interpretation

OMP_SCHEDULE This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, and GUIDED.
For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

OMP_NUM_THREADS Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine.
When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system.

OMP_DYNAMIC Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE.

OMP_NESTED Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE.

¹Environment variable names must be in uppercase; the assigned values are not case-sensitive.

The Compaq Fortran parallel compiler environment variables are listed in Table 6-5.

Table 6-5 Compaq Fortran Parallel Environment Variables
Environment Variable¹ Interpretation

MP_THREAD_COUNT Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.

MP_CHUNK_SIZE Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.

MP_STACK_SIZE Specifies how many bytes of stack space the run-time system allocates for each thread when creating it. If you specify zero, the run-time system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the run-time system allocates 5 MB.

MP_SPIN_COUNT Specifies how many times the run-time system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.

MP_YIELD_COUNT Specifies how many times the run-time system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

¹Environment variable names must be in uppercase; the assigned values are not case-sensitive.

6.5 Calls to Programs Written in Other Languages

Note

This section contains information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

Only programs written in Compaq Fortran support parallel directives. Any procedures or routines called from within a parallel region in a Compaq Fortran program must consider the following:

Compile any Compaq Fortran programs containing parallel directives using the -mp or the -omp option.
Called procedures or routines must be thread safe.
It is the programmer's responsibility to ensure that all data objects in the called procedures or routines are shared or allocated on each thread's private stack.

6.6 Compiling, Linking, and Running Parallelized Programs on SMP Systems

Note

This section contains information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

Whether you compile and link your program in one step or in separate steps, you must include the name of the f90 Compaq Fortran driver (and the -omp or -mp option if you want to use parallel compiler directives) on each command line. For example, to compile and link the program prog.f with its OpenMP Fortran API directives in one step, use the command:

% f90 -omp prog.f -o prog

To separately compile and link the program prog.f , use these commands:

% f90 -c -omp prog.f % f90 -omp prog.o -o prog

To run your program, use the command:

% prog

When you use the -omp (or -mp ) option, the driver sets the -reentrancy threaded and the -automatic options for the compiler if you did not specify them on the command line. The options are not set if you used the negated forms of the options on the command line. The driver also sets the -pthread and library options for the linker.

6.7 Debugging Parallelized Programs

Note

This section contains information that applies to both the OpenMP Fortran API and the Compaq Fortran parallel compiler directives.

When a Compaq Fortran program uses parallel decomposition directives, there are some special considerations concerning how that program can be debugged.

When a bug occurs in a Compaq Fortran program that uses parallel decomposition directives, the bug might be caused by incorrect Compaq Fortran statements, or it might be caused by incorrect parallel decomposition directives. In either case, the program to be debugged can be executed by multiple threads simultaneously.

6.7.1 Debugger Limitations for Parallelized Programs

Debuggers such as the Compaq Ladebug debugger provide features that support the debugging of programs that are executed by multiple threads. However, the currently available versions of Ladebug do not directly support the debugging of parallel decomposition directives, and therefore, there are limitations on the debugging features.

Other debuggers are available for use on UNIX. Before attempting to debug programs containing parallel decomposition directives, determine what level of support the debugger provides for these directives by reading the documentation or by contacting the supplier of the debugger.

Some of the new features used in OpenMP are not yet fully supported by the debuggers, so it is important to understand how these features work to understand how to debug them. The two problem areas are:

Outlining of parallel regions (see Section 6.7.2)
Shared variables (see Section 6.7.3)

6.7.2 Debugging Parallel Regions

The compiler implements a parallel region by taking the code in the region and putting it into a separate, compiler-created subroutine. This process is called outlining because it is the inverse of inlining a subroutine into its call site.

In place of the parallel region, the compiler inserts a call to a run-time library routine, which starts up threads and causes them to call the outlined routine. As threads return from the outlined routine, they return to the run-time library, which waits for all threads to finish before returning to the master thread in the original program.

Example 6-6 contains a section of the source listing with machine code (produced using f90 -omp -v -machine_code ). Note that the original program unit was named outline_example and the parallel region was at line 2. The compiler created an outlined routine called _2_outline_example_ . In general, the outlined routine is named _line-number_original-routine-name .

Example 6-6 Code Using Parallel Region

OUTLINE_EXAMPLE Source Listing 1 program outline_example 2 !$omp parallel 3 print *, 'hello world' 4 !$omp end parallel 5 print *, 'done' 6 end OUTLINE_EXAMPLE Machine Code Listing .text .ent _2_outline_example_ .eflag 16 0000 _2_outline_example_: 27BB0001 0000 ldah gp, _2_outline_example_ 23BD8180 0004 lda gp, _2_outline_example_ 23DEFFA0 0008 lda sp, -96(sp) B75E0000 000C stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 A45D8040 0010 ldq r2, 48(gp) A77D8020 0014 ldq r27, for_write_seq_lis 63FF0000 0018 trapb 47E17400 001C mov 11, r0 265F0385 0020 ldah r18, 901(r31) A67D8018 0024 ldq r19, 8(gp) B3FE0008 0028 stl r31, var$0001 221E0008 002C lda r16, var$0001 B41E0048 0030 stq r0, 72(sp) 47E0D411 0034 mov 6, r17 B45E0050 0038 stq r2, 80(sp) 2252FF00 003C lda r18, -256(r18) 229E0048 0040 lda r20, 72(sp) 6B5B4000 0044 jsr r26, for_write_seq_lis 27BA0001 0048 ldah gp, _2_outline_example_ 23BD8180 004C lda gp, _2_outline_example_ A75E0000 0050 ldq 63FF0000 0054 trapb 23DE0060 0058 lda sp, 96(sp) 6BFA8001 005C ret (r26) .end _2_outline_example_ Routine Size: 96 bytes, Routine Base: $CODE$ + 0000 .globl outline_example_ .ent outline_example_ .eflag 16 0060 outline_example_: 27BB0001 0060 ldah gp, outline_example_ 23BD8180 0064 lda gp, outline_example_ A77D8038 0068 ldq r27, for_set_reentrancy 23DEFFA0 006C lda sp, -96(sp) A61D8010 0070 ldq r16, (gp) B75E0000 0074 stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 6B5B4000 0078 jsr r26, for_set_reentrancy 27BA0001 007C ldah gp, outline_example_ 23BD8180 0080 lda gp, outline_example_ 47FE0411 0084 mov sp, r17 A77D8028 0088 ldq r27, _OtsEnterParallelOpenMP A61D8030 008C ldq r16, _2_outline_example_ 47FF0412 0090 clr r18 6B5B4000 0094 jsr r26, _OtsEnterParallelOpenMP 27BA0001 0098 ldah gp, outline_example_ 47E09401 009C mov 4, r1 23BD8180 00A0 lda gp, outline_example_ 265F0385 00A4 ldah r18, 901(r31) A47D8018 00A8 ldq r3, 8(gp) A77D8020 00AC ldq r27, for_write_seq_lis A67D8018 00B0 ldq r19, 8(gp) 221E0008 00B4 lda r16, var$0001 20630008 00B8 lda r3, 8(r3) B3FE0008 00BC stl r31, var$0001 B43E0048 00C0 stq r1, 72(sp) 47E0D411 00C4 mov 6, r17 B47E0050 00C8 stq r3, 80(sp) 2252FF00 00CC lda r18, -256(r18) 229E0048 00D0 lda r20, 72(sp) 6B5B4000 00D4 jsr r26, for_write_seq_lis 27BA0001 00D8 ldah gp, outline_example_ A75E0000 00DC ldq r26, (sp) 23BD8180 00E0 lda gp, outline_example_ 47E03400 00E4 mov 1, r0 23DE0060 00E8 lda sp, 96(sp) 6BFA8001 00EC ret (r26) .end outline_example_

In the preceding example, the run-time library routine _OtsEnterParallelOpenMP is responsible for creating threads (if they have not already been created) and causing them to call the outlined routine. The outlined routine is called once by each thread.

Debugging the program at this level is just like debugging a program that uses POSIX threads directly. Breakpoints can be set in the outlined routine just like any other routine (leave off the trailing underscore. However, all Compaq Fortran routines are appended with a trailing underscore, so the debugger automatically inserts it.

6.7.3 Debugging Shared Variables

When a variable appears in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION clause on some block, the variable is made private to the parallel region by redeclaring it in the block. SHARED data, however, is not declared in the outlined routine. Instead, it gets its declaration from the parent routine.

When in a debugger, you can switch from one thread to another. Each thread has its own program counter so each thread can be in a different place in the code. Example 6-7 shows a Ladebug session.

Example 6-7 Code Using Multiple Threads

% ladebug a.out Welcome to the Ladebug Debugger Version 4.0-xx ------------------ object file name: a.out Reading symbolic information ...done (ladebug) stop in _2_outline_example [#1: stop in subroutine _2_outline_example() ] (ladebug) run [1] stopped at [_2_outline_example:2 0x120002d14] 2 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- >* 1 running throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 2 ready not started throughput 11 <anonymous> 3 ready not started throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug)

Thread 1 is the master thread. Do not confuse debugger thread numbers with OpenMP thread numbers. The compiler numbers threads beginning at zero, but the debugger numbers threads beginning at 1. There are also two extra threads in the debugging process, numbered -1 and -2 , for use by the kernel.

Thread 1 has started running and is currently stopped just inside the outlined routine. The other threads have not started running because the example session was run on a uniprocessor workstation. On a multiprocessor, the other threads can run on different processors, so switch processors and examine the stack as shown in Example 6-8.

Example 6-8 Code Using Multiple Processors

(ladebug) thread 2 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 2 ready not started throughput 11 <anonymous> (ladebug) where >0 0x3ff805739e0 in thdBase(0x14005d7d0, 0x0, 0x0, 0x120003c20, 0x4, 0x0) (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- >* 1 running throughput 11 default thread (ladebug) where >0 0x120002d14 in _2_outline_example() omp_hello.f:2 #1 0x12000495c in _OtsEnterParallelOpenMP() #2 0x120002d98 in outline_example() omp_hello.f:1 #3 0x120002ccc in main() for_main.c:203 (ladebug)

Thread 2 has not yet started and is reported as being in thdBase, a POSIX run-time support routine that threads run when they are created. Thread 1 is the master thread and is currently executing the outlined routine, called from the run-time library, which was called from the original program.

Note that only the master thread (thread 1) has a full call tree. The other threads have thdBase(), from which they call the outlined routine. If you want to look at variables higher on the call stack than the parallel region, you must first tell the debugger to switch to thread 1, and then use the up command to climb the call stack.

If SHARED data is in common blocks, the outlined routine accesses it the same way any other routine would. If the SHARED data is automatic storage associated with the routine where the parallel region appears, however, each thread has a pointer to the master thread stack when the parallel region is reached.

Variables on the master stack can be accessed through the pointer. The compiler handles this automatically and does describe the access in the symbol table, but Ladebug and TotalView currently do not support this uplevel access mechanism.

Example 6-9 makes this clearer.

Example 6-9 Code Using Shared Variables

UPLEVEL Source Listing 1 program uplevel 2 implicit none 3 integer i 4 5 !$omp parallel 6 !$omp atomic 7 i = i + 1 8 !$omp end parallel 9 10 print *, i 11 end UPLEVEL Machine Code Listing .text .ent _5_uplevel_ .eflag 16 0000 _5_uplevel_: 23DEFFC0 0000 lda sp, -64(sp) .frame $sp, 64, $26 .prologue 0 47E10402 0004 mov r1, __StaticLink.1 # r1, r2 63FF0000 0008 trapb 20620010 000C lda r3, 16(r2) 0010 L$3: A8230000 0010 ldl_l r1, (r3) 40203000 0014 addl r1, 1, r0 B8030000 0018 stl_c r0, (r3) E4000003 001C beq r0, L$4 63FF0000 0020 trapb 23DE0040 0024 lda sp, 64(sp) 6BFA8001 0028 ret (r26) 002C L$4: C3FFFFF8 002C br L$3 .end _5_uplevel_ Routine Size: 48 bytes, Routine Base: $CODE$ + 0000 .globl uplevel_ .ent uplevel_ .eflag 16 0030 uplevel_: 27BB0001 0030 ldah gp, uplevel_ # gp, (r27) 23BD8130 0034 lda gp, uplevel_ # gp, (gp) 23DEFFA0 0038 lda sp, -96(sp) B75E0000 003C stq r26, (sp) .mask 0x04000000,-96 .fmask 0x00000000,0 .frame $sp, 96, $26 .prologue 1 A61D8010 0040 ldq r16, (gp) A77D8038 0044 ldq r27, for_set_reentrancy # r27, 40(gp) 6B5B4000 0048 jsr r26, for_set_reentrancy # r26, (r27) 27BA0001 004C ldah gp, uplevel_ # gp, (r26) 23BD8130 0050 lda gp, uplevel_ # gp, (gp) A61D8030 0054 ldq r16, _5_uplevel_ # r16, 32(gp) 47FE0411 0058 mov sp, r17 47FF0412 005C clr r18 A77D8028 0060 ldq r27, _OtsEnterParallelOpenMP # r27, 24(gp) 6B5B4000 0064 jsr r26, _OtsEnterParallelOpenMP # r26, (r27) 27BA0001 0068 ldah gp, uplevel_ # gp, (r26) 23BD8130 006C lda gp, uplevel_ # gp, (gp) B3FE0018 0070 stl r31, var$0001 # r31, 24(sp) A67D8018 0074 ldq r19, 8(gp) 203E0010 0078 lda r1, I # r1, 16(sp) B43E0058 007C stq r1, 88(sp) 221E0018 0080 lda r16, var$0001 # r16, 24(sp) 47E0D411 0084 mov 6, r17 265F0385 0088 ldah r18, 901(r31) 2252FF00 008C lda r18, -256(r18) 229E0058 0090 lda r20, 88(sp) A77D8020 0094 ldq r27, for_write_seq_lis # r27, 16(gp) 6B5B4000 0098 jsr r26, for_write_seq_lis # r26, (r27) 27BA0001 009C ldah gp, uplevel_ # gp, (r26) 23BD8130 00A0 lda gp, uplevel_ # gp, (gp) 47E03400 00A4 mov 1, r0 A75E0000 00A8 ldq r26, (sp) 23DE0060 00AC lda sp, 96(sp) 6BFA8001 00B0 ret (r26) .end uplevel_ Routine Size: 132 bytes, Routine Base: $CODE$ + 0030

Note that in this example in the main routine, the variable i is kept at offset 16 from the stack pointer. The stack pointer is passed into _OtsEnterParallelOpenMP , which puts it into register r1 before calling _5_uplevel_ . Each thread uses indirect address through this address to get to the shared i.

Because the debuggers have not yet been adjusted to understand uplevel addressing, the variable i does not appear to be declared in the outlined region, only in the original routine. To look at the value of the shared variable, we have to switch threads to the master thread and then get into the appropriate context. This is shown in Example 6-10.

Example 6-10 Code Looking at a Shared Variable Value

% ladebug a.out Welcome to the Ladebug Debugger Version 4.0-xx ------------------ object file name: a.out Reading symbolic information ...done (ladebug) stop in _5_uplevel [#1: stop in subroutine _5_uplevel() ] (ladebug) run [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) where >0 0x120002cd8 in _5_uplevel() omp_uplevel.f:5 #1 0x1200048ec in _OtsEnterParallelOpenMP #2 0x120002d34 in uplevel() omp_uplevel.f:1 #3 0x120002c9c in main() for_main.c:203 (ladebug) p i 0 (ladebug) c [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- 1 ready throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 >* 2 running throughput 11 <anonymous> 3 ready not started throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug) p i Error: no value for symbol I Error: no value for i (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 1 ready throughput 11 default thread (ladebug) where >0 0x12000493c in _OtsEnterParallelOpenMP #1 0x120002d34 in uplevel() omp_uplevel.f:1 #2 0x120002c9c in main() for_main.c:203 (ladebug) p i 1 (ladebug) c [1] stopped at [_5_uplevel:5 0x120002cd8] 5 !$omp parallel (ladebug) show thread Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- 1 ready throughput 11 default thread -1 blocked kernel fifo 32 manager thread -2 ready idle 0 null thread for VP 0x0 2 ready throughput 11 <anonymous> >* 3 running throughput 11 <anonymous> 4 ready not started throughput 11 <anonymous> 5 ready not started throughput 11 <anonymous> 6 ready not started throughput 11 <anonymous> (ladebug) where >0 0x120002cd8 in _5_uplevel() omp_uplevel.f:5 #1 0x120003d90 in slave_main(arg=2) ots_parallel.bli:859 #2 0x3ff80573ea4 in thdBase(0x0, 0x0, 0x0, 0x1, 0x45586732, 0x3) DebugInformationStrippedFromFile101 (ladebug) p i Error: no value for symbol I Error: no value for i (ladebug) thread 1 Thread State Substate Policy Priority Name ------ ---------- --------------- ---------- -------- ------------- > 1 ready throughput 11 default thread (ladebug) up >1 0x120002d34 in uplevel() omp_uplevel.f:1 1 program uplevel (ladebug) p i 2 (ladebug) q %

Contents

Index

Directive Format	Description
prefix ATOMIC
	This directive defines a synchronization construct that ensures that a specific memory location is updated atomically. See Section 6.1.9.1, ATOMIC Directive.

prefix BARRIER
	This directive defines a synchronization construct that synchronizes all the threads in a team. See Section 6.1.9.2, BARRIER Directive.

prefix CRITICAL [(name)] block prefix END CRITICAL [(name)]
	These directives define a synchronization construct that restricts access to the contained code to only one thread at a time. See Section 6.1.9.3, CRITICAL and END CRITICAL Directives.

prefix DO [clause[[,] clause] ...] do_loop [prefix END DO [NOWAIT]]
	These directives define a worksharing construct that specifies that the iterations of the DO loop are executed in parallel. See Section 6.1.7.1, DO and END DO directives.

prefix FLUSH [(var[,var]...)]
	This directive defines a synchronization construct that identifies the precise point at which a consistent view of memory is provided. See Section 6.1.9.4, FLUSH Directive.

prefix MASTER block prefix END MASTER
	These directives define a synchronization construct that specifies that the contained block of code is to be executed only by the master thread of the team. See Section 6.1.9.5, MASTER and END MASTER Directives.

prefix ORDERED block prefix END ORDERED
	These directives define a synchronization construct that specifies that the contained block of code is executed in the order in which iterations would be executed during a sequential execution of the loop. See Section 6.1.9.6, ORDERED and END ORDERED Directives.

prefix PARALLEL [clause[[,] clause] ...] block prefix END PARALLEL
	These directives define a parallel construct that is a region of a program that must be executed by a team of threads until the END PARALLEL directive is encountered. See Section 6.1.6, Parallel Region: PARALLEL and END PARALLEL Directives.

prefix PARALLEL DO [clause[[,] clause] ...] do_loop prefix END PARALLEL DO
	These directives define a combined parallel/worksharing construct that is an abbreviated form of specifying a parallel region that contains a single DO directive. See Section 6.1.8.1, PARALLEL DO and END PARALLEL DO Directives.

prefix PARALLEL SECTIONS [clause[[,] clause] ...] block prefix END PARALLEL SECTIONS
	These directives define a combined parallel/worksharing construct that is an abbreviated form of specifying a parallel region that contains a single SECTIONS directive. See Section 6.1.8.2, PARALLEL SECTIONS and END PARALLEL SECTIONS Directives.

prefix SECTIONS [clause[[,] clause] ...] [prefix SECTION] block [prefix SECTION block ] . . . prefix END SECTIONS [NOWAIT]
	These directives define a worksharing construct that specifies that the enclosed sections of code are to be divided among threads in the team. Each section is executed once by some thread in the team. See Section 6.1.7.2, SECTIONS, SECTION, and END SECTIONS Directives.

prefix SINGLE [clause[[,] clause] ...] block prefix END SINGLE [NOWAIT]
	These directives define a worksharing construct that specifies that the enclosed code is to be executed by only one thread in the team. See Section 6.1.7.3, SINGLE and END SINGLE Directives.

prefix THREADPRIVATE(/cb/[,/cb/] ...)
	This data environment directive makes named common blocks private to a thread, but global within the thread. See Section 6.1.4, Privatizing Named Common Blocks: THREADPRIVATE Directive.

Directive Format	Description
prefix BARRIER
	This directive defines a synchronization construct that synchronizes all the threads in a team. See Section 6.2.9.1, BARRIER Directive.

prefix CHUNK = chunksize
	This directive sets a default chunk size used to divide iterations among the threads of the team. See Section 6.2.10, Specifying a Default Chunk Size.

prefix COPYIN object[, object]...
	This data environment directive specifies that the listed variables, single array elements, and common blocks be copied from the master thread to the PRIVATE data objects having the same name. Single array elements can be copied, but array sections cannot be copied. Shared variables cannot be copied. When an allocatable array is to be copied, it must be allocated when the COPYIN directive is encountered. This directive is allowed only within PARALLEL and PARALLEL DO directives.

prefix CRITICAL SECTION [(latch-var)] code prefix END CRITICAL SECTION
	These directives define a synchronization construct that specifies a block of code that is executed by one thread at a time. See Section 6.2.9.2, CRITICAL SECTION and END CRITICAL SECTION Directives.
prefix INSTANCE SINGLE PARALLEL /com-blk-name/[[,]/com-blk-name/]...
	This data environment directive makes named common blocks available to threads. See Section 6.2.4, Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives.

prefix MP_SCHEDTYPE = mode
	This directive sets a default run-time schedule type. See Section 6.2.11, Specifying a Default Schedule Type.
prefix PARALLEL [region-option[[,]region-option]...] code prefix END PARALLEL
	These directives define a parallel construct that is a region of a program that must be executed by a team of threads in parallel until the END PARALLEL directive is encountered. See Section 6.2.6, Parallel Region: PARALLEL and END PARALLEL Directives .

prefix PARALLEL DO DOACROSS [par-do-option[[,]par-do-option]...] do_loop [prefix END PARALLEL DO]
	These directives define a combined parallel/worksharing construct that specifies an abbreviated form of specifying a parallel region that contains a single PDO directive. See Section 6.2.8.1, PARALLEL DO and END PARALLEL DO Directives.

prefix PARALLEL SECTIONS [par-sect-option[[,]par-sect-option]...] code prefix END PARALLEL SECTIONS
	These directives define a combined parallel/worksharing construct that specifies an abbreviated form of specifying a parallel region that contains a single SECTION directive. The semantics are identical to explicitly specifying the PARALLEL directive immediately followed by a PSECTIONS directive. See Section 6.2.8.2, PARALLEL SECTIONS and END PARALLEL SECTIONS Directives.

prefix PDO [pdo-option[[,]pdo-option]...] do_loop [prefix END PDO [NOWAIT]]
	These directives define a worksharing construct that specifies that each set of iterations of the contained DO LOOP is a unit of work that can be scheduled on a single thread. See Section 6.2.7.1, PDO and END PDO Directives.

prefix PDONE
	This directive specifies that the DO loop in which the PDONE directive is contained should be terminated early. See Section 6.2.12, Terminating Loop Execution Early: PDONE Directive.

prefix PSECTION[S] [sect-option[[,]sect-option]...] [prefix SECTION] code [prefix SECTION code ] prefix END PSECTION[S] [NOWAIT]
	These directives define a worksharing construct that specifies that the enclosed sections of code are to be divided among threads in the team. See Section 6.2.7.2, PSECTIONS, SECTION, and END PSECTIONS Directives.

prefix SINGLE PROCESS [proc-option[[,]proc-option] ...] code prefix END SINGLE PROCESS [NOWAIT]
	These directives define a worksharing construct that specifies a block of code that is executed by only one thread. See Section 6.2.7.3, SINGLE PROCESS and END SINGLE PROCESS Directives.

prefix TASKCOMMON com-blk-name[,com-blk-name]...
	This data environment directive makes named common blocks private to a thread, but global within the thread. See Section 6.2.4, Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives.

Original Structure	Interchanged Structure
	`!$OMP PARALLEL PRIVATE (J,I) SHARED (A)`
	`!$OMP DO`
`DO I = 1,100`	`DO J = 1,300`
`DO J = 1,300`	`DO I = 1,100`
`A(I,J) = A(I+1,J) + 1`	`A(I,J) = A(I+1,J) + 1`
`END DO`	`END DO`
`END DO`	`END DO`
	`!$OMP END DO`
	`!$OMP END PARALLEL`

Environment Variable¹	Interpretation
`OMP_SCHEDULE`	This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, and GUIDED. For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.
`OMP_NUM_THREADS`	Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the `OMP_SET_NUM_THREADS` run-time library routine. When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system.
`OMP_DYNAMIC`	Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE.
`OMP_NESTED`	Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE.

Environment Variable¹	Interpretation
`MP_THREAD_COUNT`	Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.
`MP_CHUNK_SIZE`	Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.
`MP_STACK_SIZE`	Specifies how many bytes of stack space the run-time system allocates for each thread when creating it. If you specify zero, the run-time system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the run-time system allocates 5 MB.
`MP_SPIN_COUNT`	Specifies how many times the run-time system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.
`MP_YIELD_COUNT`	Specifies how many times the run-time system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

Compaq FortranUser Manual for Tru64 UNIX and Linux Alpha Systems

Chapter 6Parallel Compiler Directives and Their Programming Environment

6.1.1 Command-Line Option and Directives Format

6.1.4 Privatizing Named Common Blocks: THREADPRIVATE Directive

6.1.7.1 DO and END DO directives

6.1.7.2 SECTIONS, SECTION, and END SECTIONS Directives

6.1.7.3 SINGLE and END SINGLE Directives

6.1.8 Combined Parallel/Worksharing Constructs

6.1.8.1 PARALLEL DO and END PARALLEL DO Directives

6.1.8.2 PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

6.1.9 Synchronization Constructs

6.1.9.1 ATOMIC Directive

6.2 Compaq Fortran Parallel Compiler Directives

6.2.1 Command-Line Option and Directives Format

6.2.2 Summary Descriptions of Compaq Fortran Parallel Compiler Directives

6.2.4 Privatizing Named Common Blocks: TASKCOMMON or INSTANCE Directives

6.2.5 Controlling Data Scope Attributes

6.2.7 Worksharing Constructs

6.2.7.1 PDO and END PDO Directives

6.2.7.3 SINGLE PROCESS and END SINGLE PROCESS Directives

6.2.8 Combined Parallel/Worksharing Constructs

6.2.8.1 PARALLEL DO and END PARALLEL DO Directives

6.2.8.2 PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

6.2.9 Synchronization Constructs

6.2.9.1 BARRIER Directive

6.2.10 Specifying a Default Chunk Size

6.2.11 Specifying a Default Schedule Type

6.3.2 Resolving Dependences Manually

6.3.2.1 Resolving Dependences Involving Temporary Variables

6.3.4 Manual Optimization

6.3.4.1 Interchanging Loops

6.5 Calls to Programs Written in Other Languages

6.6 Compiling, Linking, and Running Parallelized Programs on SMP Systems

6.7.2 Debugging Parallel Regions

Compaq Fortran
User Manual for
Tru64 UNIX and
Linux Alpha Systems

Chapter 6
Parallel Compiler Directives and Their Programming Environment