HP OpenVMS Systems Documentation

OpenVMS MACRO-32 Porting and User's Guide

The macros $LOCKED_PAGE_START and $LOCKED_PAGE_END mark the beginning and end of a code segment which may be locked. The code delineated by these macros must contain complete routines---execution cannot fall through either macro, nor can the locked code be branched into or out of. Any attempt to branch into or out of the locked code section, or to fall through the macros will be flagged by the compiler with the following error message:

%AMAC-E-MULTLKSEC, Routines which share code must use the same linkage psect.

$LOCKED_PAGE_END has an optional parameter, LINK_SECT, which is used to specify the linkage psect to return to after the routine is executed. It is only used if the linkage psect in effect when the $LOCKED_PAGE_START macro was executed was not the default linkage psect, $LINKAGE.

The macro $LOCK_PAGE_INIT must be executed in the initialization routines of an image which is using $LOCKED_PAGE_START and $LOCKED_PAGE_END to delineate areas to be locked. It creates the necessary psects and issues the $LKWSET calls to lock the code and linkage sections into the working set. R0 and R1 are destroyed by this macro.

$LOCK_PAGE_INIT has an optional parameter, ERROR, which is an error address to which to branch if one of the $LKWSET calls fail. If this address is reached, R0 reflects the status of the failed call, and R1 contains 0 if the call to lock the code failed, or 1 if that call succeeded but the call to lock the linkage section failed.

Note that since psects are used to identify code to be locked, the $LOCK_PAGE_INIT macro need not be in the same module as the code delineated by the $LOCKED_PAGE_START and $LOCKED_PAGE_END macros. The invocation of $LOCK_PAGE_INIT locks all delineated code in the entire image.

Table 3-1 shows the code changes necessary for using these macros. The delineating labels are replaced by the $LOCKED_PAGE_START and $LOCKED_PAGE_END macros. The descriptor is eliminated, and the $LKWSET call in the initialization code is replaced by $LOCK_PAGE_INIT.

Table 3-1 Image Initialization-Time Lockdown

Code Section

On VAX Systems

On Alpha Systems

Data declaration

LOCK_DESCRIPTOR:

 .ADDRESS LOCK_START

 .ADDRESS LOCK_END

Nothing. Eliminate the descriptor altogether.

Initialization

 $LKWSET_S LOCK_DESCRIPTOR

 BLBC R0,ERROR

 $LOCK_PAGE_INIT ERROR

Main code

LOCK_START:

Routine_A:

 .

 .

 .

 RSB

LOCK_END:

 $LOCKED_PAGE_START

Routine_A:

 .

 .

 .

 RSB

 $LOCKED_PAGE_END

Locking Code Written in Other Languages

Code written in other programming languages can also be locked down by using the $LOCK_PAGE_INIT macro in a VAX MACRO module. Any code in any module written in any language will be locked by this macro if the psect $LOCK_PAGE_2 is used for the generated code and the psect $LOCK_LINKAGE_2 is used for the generated linkage section.

On-the-Fly Lockdown

For on-the-fly lockdown, $LOCK_PAGE and $UNLOCK_PAGE, respectively, mark the beginning and end of a section of code to be locked. The marked code becomes a separate routine in the locked psect, where all code locked anywhere in the image is placed.

$LOCK_PAGE locks the pages and linkage section of the locked routine into the working set and JSRs to it. This macro is placed inline in executable code. All code between this macro and the matching $UNLOCK_PAGE macro is included in the locked routine and is locked down.

$UNLOCK_PAGE returns from the locked routine and then unlocks the pages and linkage section from the working set. The macro is placed inline in executable code at some point after a $LOCK_PAGE macro.

$LOCK_PAGE and $UNLOCK_PAGE both have an optional parameter, ERROR, which is an error address to which to branch if the $LKWSET or $ULWSET calls fail. $UNLOCK_PAGE has a second optional parameter, LINK_SECT. LINK_SECT is a linkage psect to which to return if the linkage psect in effect when the $LOCK_PAGE macro was executed was not the default linkage psect, $LINKAGE.

All registers are preserved by both macros unless the error address parameter is present and one of the calls fail, in which case R0 reflects the status of the failed call. R1 then contains 0 if the call to lock or unlock the code failed, and 1 if that call succeeded but the call to lock or unlock the linkage section failed.

Control must enter the code through the $LOCK_PAGE macro, and must leave through the $UNLOCK_PAGE macro. The local symbol block that is in effect when the $LOCK_PAGE macro is executed is restored when the $UNLOCK_PAGE macro is executed, but since the locked code becomes a separate routine, the locked code itself is a separate local symbol block. Even if named symbols are used, branches into or out of the locked code section are not allowed, and will be flagged by the compiler with the following error:

%AMAC-E-MULTLKSEC, Routines which share code must use the same linkage psect.

Note that since the locked code is made into a separate routine, any references to local stack storage within the routine will have to be changed, as the stack context is no longer the same.

Note

Because on-the-fly lockdown requires the overhead of four system service calls plus an extra subroutine call every time it is executed, it is recommended that this be changed to initialization-time lockdown if the lockdown is done for any performance-critical code. If other routines in the image use initialization-time lockdown, then you must change the on-the-fly lockdown to initialization-time lockdown.

Table 3-2 shows the code changes required to use these macros for on-the-fly lockdown. Note that the $UNLOCK_PAGE macro precedes the RSB, so that it is executed. Any status being passed by the routine in R0 and R1 remains intact because $UNLOCK_PAGE preserves these registers.

**Table 3-2 On-the-Fly Lockdown**
Code Section	On VAX Systems	On Alpha Systems
Main code	Routine_A: . . . SETIPL 100$ . . . RSB 100$: .LONG IPL$SYNCH	Routine_A: .JSB_ENTRY . . . $LOCK_PAGE . . . $UNLOCK_PAGE RSB

Table 3-3 shows the same original code and the changes necessary for initialization-time lockdown.

**Table 3-3 Image Initialization-Time Lockdown with the Same Code**
Code Section	On VAX Systems	On Alpha Systems
Initialization	Nothing.	$LOCK_PAGE_INIT
Main code	Routine_A: . . . SETIPL 100$ . . . RSB 100$: .LONG IPL$SYNCH	$LOCKED_PAGE_START Routine_A: .JSB_ENTRY . . . RSB $LOCKED_PAGE_END

3.11 Synchronization

The following statements and recommendations regarding synchronization are relevant to the porting of code from VAX systems to Alpha systems:

Code that issues longword operations to aligned longwords in memory continues to work on Alpha systems without additional synchronization required. This is architecturally guaranteed.
The Alpha architecture extends this guarantee to include quadword operations to aligned quadwords in memory. However, this is not backwards-compatible to VAX systems. Only Alpha code can depend upon this feature.
Interlocked instructions (BBSSI, BBCCI, and ADAWI) still work. However, keep the following in mind when you use them:
1. When compiling these instructions, the MACRO-32 compiler provides memory barrier functionality implicitly.
2. These instructions assume a byte granularity environment. If the data segment on which these instructions operate can be concurrently written by different threads, you may need to impose additional synchronization of the data segment using the MACRO-32 compiler's PRESERVE feature.
3. Another way to address the byte granularity problem and achieve greater performance at the same time is to restructure the data segments to be unpacked. That is, the bit that is changed by BBSSI or BBCCI, or the word that is modified by ADAWI, should reside in a longword where the other portions of the longword are not modified by an independent and concurrent instruction thread.
  Further separation of the data in question, such that independent and concurrent access to any location in the aligned 128-byte lock range that contains the data is not occurring, will result in additional performance gains on many Alpha implementations of the Load-locked/Store-conditional instructions.
The VAX interlocked queue instructions work unchanged on Alpha systems and result in the PALcode equivalents being called which incorporate the necessary interlocks and memory barriers.
Note that the noninterlocked queue instructions are also compiled to their PALcode equivalents and that they are still atomic on a single processor.

The VAX synchronization tools work unchanged on Alpha. All of the following mechanisms use interlocked instructions directly or indirectly for synchronization. The interlocked instructions that are used provide memory barriers transparently.

Event Flags---all of the system services that manipulate them
Spin locks---all of the acquisition and release operators (LOCK and UNLOCK, FORKLOCK and FORKUNLOCK, DEVICELOCK and DEVICEUNLOCK)

Mutexes---protected by spin locks

Note

This synchronization guarantee is only true for multiprocessing systems. The uniprocessing version of spin locks does not use interlocked instructions. As a result, memory barriers are not provided in uniprocessor spin lock, mutex, and lock manager synchronization.

Lock Manager---protected by spin locks

Regarding ASTs, concurrent threads and atomicity, one must either redesign the code or force atomicity using features provided by the compilers. The MACRO-32 compiler provides the PRESERVE feature.

Code that modifies exception handlers may require changes if it is possible for an outstanding arithmetic trap or a machine abort or both to occur asynchronously. The TRAPB and DRAINA instructions provide the synchronization mechanisms that are required. If you want to force synchronization when changing handlers, you must manually add these to your program as shown in the following example:

ADDL3 R1, R2, 4(R3)      ; Save total
EVAX_TRAPB                 ; Insure any arithmetic traps handled by
                           ; existing handler
MOVAB   HANDLER2, 0(FP)    ; Enable new condition handler

When writing OpenVMS Alpha assembly language code, make sure that you understand the read/write ordering of the Alpha architecture. Encode MB instructions where necessary.

Chapter 4
Improving the Performance of Ported Code

This chapter describes how you can improve the performance of your ported code. The topics described in this chapter follow:

Aligning data ( Section 4.1)
Code flow and branch prediction ( Section 4.2)
Code optimization ( Section 4.3)
Common-based referencing ( Section 4.4)

4.1 Aligning Data

An unaligned data reference will work but will be slow on OpenVMS Alpha, because the system must take an unaligned address fault to complete the unaligned reference. If it is known that a data reference is unaligned, the compiler can generate unaligned quadword loads and masks to manually extract the data. This is slower than an aligned load but much faster than taking an alignment fault. Global data labels that are not longword or quadword aligned are flagged with information-level messages.

In addition, unaligned memory modification references cannot be made atomic with /PRESERVE=ATOMICITY or .PRESERVE ATOMICITY. If this is attempted, it will cause a fatal reserved operand fault.

4.1.1 Alignment Assumptions

By default, the compiler assumes the following:

Addresses in registers used as base pointers are longword aligned at routine entry
External references are longword aligned
Addresses that resulted from certain types of instructions, such as DIVL, are assumed unaligned

Every time a register is changed, the compiler determines whether the base address in the register is still aligned. If the register and specified offset result in an aligned address, the compiler uses an aligned load or store for a memory reference. The compiler attempts to track register usage in terms of whether the base address remains aligned. When a stored memory address is loaded, for instance, MOVL 4(R1),R0, or used indirectly for instance, MOVL@4(R1),R0, the compiler assumes the resulting address is aligned.

For quadword memory references such as MOVQ instructions, the compiler assumes the base address is quadword aligned, unless it has determined by means of its register tracking code that the address may not be longword aligned. In other words, quadword register alignment is not tracked---only longword alignment.

Quadword references in Alpha built-ins, such as those in the following example, will be in new code, where alignment should be correct. Therefore all memory references in the following example will use aligned quadword load/stores:

EVAX_LDQ  R1, (R2)
EVAX_ADDQ  (R1), #1, (R3)

If an Alpha built-in (other than EVAX_LDQU or EVAX_STQU) is used on an address that is not quadword aligned, an alignment fault will occur at run time.

4.1.2 Directives and Qualifier for Changing Alignment Assumptions

The compiler provides two directives and one qualifier for changing the compiler's alignment assumptions. Both directives enable the compiler to produce more efficient code. The .SET_REGISTERS directive allows you to specify whether a register is aligned or unaligned. This directive should be used when the result of an operation is the reverse of what the compiler expects. It also allows you to declare registers that the compiler would not otherwise detect as input or output registers.

The .SYMBOL_ALIGNMENT directive allows you to specify the alignment of any memory reference that uses a symbolic offset. This directive should be used when you know the data will be aligned for every use of the symbolic offset.

These directives are described in detail in Appendix B. The examples in each description show how to use them.

The /UNALIGN qualifier to the MACRO/MIGRATION command tells the compiler to assume unaligned all the time for all register-based memory references rather than try to track the alignment. This does not affect stack-based or static references where the compiler knows the alignment.

This qualifier is described in detail in Appendix A.

4.1.3 Precedence of Alignment Controls

The order of precedence of the compiler's alignment controls, from strongest (.SYMBOL_ALIGNMENT) to weakest (built-in assumptions and tracking mechanisms), follows:

.SYMBOL_ALIGNMENT directive
.SET_REGISTER directive
/UNALIGN qualifier
Built-in assumptions and tracking mechanisms

4.1.4 Recommendations for Aligning Data

The following recommendations are provided for aligning data:

If references to the data must be made atomic with /PRESERVE=ATOMICITY or .PRESERVE ATOMICITY, the data must be aligned.
Do not fix alignment problems in public interfaces; this could break existing programs.
For data in internal or privileged interfaces, do not automatically make changes to improve data alignment. You should consider the frequency with which the data structure is accessed, the amount of work involved in realigning the structure, and the risk that things might go wrong. In judging the amount of work involved, make sure you know all accesses to the data, do not just guess. If you own all accesses in the code for which you are responsible and if you are making changes in the module (or modules) anyway, then it is safe to fix the alignment problem.
Do not routinely unpack byte and word data into longwords or quadwords. The time to do this is when you are fixing an alignment problem (word not on word boundary), subject to the aforementioned cautions and constraints, or if you know the data granularity is a problem.
If you do not own all the accesses to the data, there still may be circumstances under which fixing alignment is appropriate. If the data is frequently accessed, if performance is a real issue, and if you must unavoidably scramble the data structure anyway, it makes sense to align the structure at the same time.
It is important that you notify other programmers whose code may be affected. Do not assume in such cases that all related modules will recompile or that program documentation will help others detect errant data cell separation assumptions. Always assume that changes like this will reveal irregular programming practices and not go smoothly.

4.2 Code Flow and Branch Prediction

The Alpha architecture is pipelined, which means that before completing the current instruction, it starts to execute several instructions beyond it. By tailoring the code to keep the pipeline filled, you can make the code run significantly faster.

On each conditional branch, the Alpha architecture attempts to predict whether or not the branch is taken so that it can correctly fill the instruction pipeline with the next instruction to be executed. The architecture predicts that forward conditional branches will not be taken and backward conditional branches will be taken. A mispredicted branch costs extra time because the pipeline must be flushed, and, in addition, the instruction at the branch destination may not be in the instruction cache.

The compiler tries to follow the flow of the VAX MACRO code to generate Alpha code that has the most common code path in a contiguous block, to allow the pipelined Alpha architecture to process the code with the greatest efficiency. However, in some situations, the compiler's default rules do not generate the most efficient code. In performance sensitive code sections, you can often improve the efficiency of the generated code by giving the compiler information about which code paths will most likely be taken.

4.2.1 Default Code Flow and Branch Prediction

Generally, the compiler generates Alpha code that follows unconditional VAX MACRO branches and falls through conditional VAX MACRO branches unless it is directed otherwise. For example, consider the following VAX MACRO code sequence:

        (Code block A)
        BLBS    R0,10$
        (Code block B)
10$:
        (Code block C)
        BRB     30$
20$:
        .
        (Code block D)
        .
30$:
        (Code block E)

The Alpha code generated for this sequence looks like the following:

        (Code block A)
        BLBS    R0,10$
        (Code block B)
10$:
        (Code block C)
30$:
        (Code block E)

Note that the compiler fell through the BLBS instruction, continuing with the instructions immediately following the BLBS. At the BRB instruction, it did not generate a branch instruction at all but followed the Alpha code generated from Code block C with the Alpha code generated from Code block E, at the branch destination. Code from Code block D at label 20$ will be generated at a later point in the routine. If there is no branch to the label 20$, the compiler will report the following informational message and will not generate Alpha code for Code block D:

UNRCHCODE, unreachable code

In most cases, this algorithm produces Alpha code that matches the assumptions of the architecture:

If a conditional branch is backward in the VAX MACRO code, then the destination likely has been generated already in the Alpha code, and so the generated branch will also be backward.
If the conditional branch is forward in the VAX MACRO code, then the destination will likely not have been generated yet in the Alpha code, and so the generated branch will also be forward.

However, because the compiler follows unconditional branches, the destination of a backward VAX MACRO branch may not have been generated yet. In this case, a conditional branch that was backward in the VAX MACRO source code may become a forward branch in the generated Alpha code. See Section 4.2.5 for a further discussion and resolution of this problem.

There are some cases where the compiler may assume that a forward branch is taken. For example, consider the following common VAX MACRO coding practice:

       JSB   XYZ            ;Call a routine
       BLBS  R0,10$         ;Branch to continue on success
       BRW   ERROR          ;Destination too far for byte offset
10$:

In this case, and any case where the inline code following the branch is only a few lines and does not rejoin the code flow at the branch destination, the forward branch is considered taken. This eliminates the delay that occurs on Alpha systems for a mispredicted branch. The compiler will automatically change the sense of the branch, and will move the code between the branch and the label out of line to a point beyond the normal exit of the routine. For this example it would generate the following code:

        JSR     XYZ
        BLBC    $L1
10$:
        .
        .
        .
        (routine exit)

$L1:    BRW     ERROR

Contents

Index