HP OpenVMS Systems Documentation

OpenVMS Programming Concepts Manual

6.3.2 Multiprocessor Operations

On multiprocessor systems, you must use special methods to ensure that a read-modify-write sequence is atomic. On VAX systems, interlocked instructions provide synchronization; on Alpha systems, load-locked and store-conditional instructions provide synchronization.

On VAX systems, a number of uninterruptible instructions are provided that both read and write memory with one instruction. When used with an operand type that is accessible in a single memory operation, each instruction provides an atomic read-modify-write sequence. The sequence is atomic with respect to threads of execution on the same VAX processor, but it is not atomic to threads on other processors. For instance, when a VAX CPU executes the instruction INCL x, it issues two separate commands to memory: a read, followed by a write of the incremented value. Another thread of execution running concurrently on another processor could issue a command to memory that reads or writes location x between the INCL's read and write. Section 6.4.4 describes read-modify-write sequences that are atomic with respect to threads on all VAX CPUs in an SMP system.

On a VAX multiprocessor system, an atomic update requires an interlock at the level of the memory subsystem. To perform that interlock, the VAX architecture provides a set of interlocked instructions that include Add Aligned Word Interlocked (ADAWI), Remove from Queue Head Interlocked (REMQHI), and Branch on Bit Set and Set Interlocked (BBSSI).

If you code in VAX MACRO, you use the assembler to generate whatever instructions you tell it. If you code in a high-level language, you cannot assume that the compiler will compile a particular language statement into a specific code sequence. That is, you must tell the compiler explicitly to generate an atomic update. For further information, see the documentation for your high-level language.

On Alpha systems, there is no single instruction that performs an atomic read-modify-write operation. An atomic read-modify-write operation is only possible through a sequence that includes load-locked and store-conditional instructions, (see Section 6.4.2). Use of these instructions provides a read-modify-write operation on data within one aligned longword or quadword that is atomic with respect to threads on all Alpha CPUs in an SMP system.

6.4 Hardware-Level Synchronization

On VAX systems, the following features assist with synchronization at the hardware level:

Atomic memory references
Noninterruptible instructions
Interrupt priority level (IPL)
Interlocked memory accesses

On VAX systems, many read-modify-write instructions, including queue manipulation instructions, are noninterruptible. These instructions provide an atomic update capability on a uniprocessor. A kernel-mode code thread can block interrupt and process-based threads of execution by raising the IPL. Hence, it can execute a sequence of instructions atomically with respect to the blocked threads on a uniprocessor. Threads of execution that run on multiple processors of an SMP system synchronize access to shared data with read-modify-write instructions that interlock memory.

On Alpha systems, some of these mechanisms are present, while others have been implemented in PALcode routines.

Alpha processors provide several features to assist with synchronization. Even though all instructions that access memory are noninterruptible, no single one performs an atomic read-modify-write. A kernel-mode thread of execution can raise the IPL in order to block other threads on that processor while it performs a read-modify-write sequence or while it executes any other group of instructions. Code that runs in any access mode can execute a sequence of instructions that contains load-locked (LDx_L) and store-conditional (STx_C) instructions to perform a read-modify-write sequence that appears atomic to other threads of execution. Memory barrier instructions order a CPU's memory reads and writes from the viewpoint of other CPUs and I/O processors. Other synchronization mechanisms are provided by PALcode routines.

The sections that follow describe the features of interrupt priority level, load-locked (LDx_L) and store-conditional (STx_C) instructions, memory barriers, interlocked instructions, and PALcode routines.

6.4.1 Interrupt Priority Level

The operating system in a uniprocessor system synchronizes access to systemwide data structures by requiring that all threads sharing data run at the highest-priority IPL of the highest-priority interrupt that causes any of them to execute. Thus, a thread's accessing of data cannot be interrupted by any other thread that accesses the same data.

The IPL is a processor-specific mechanism. Raising the IPL on one processor has no effect on another processor. You must use a different synchronization technique on SMP systems where code threads run concurrently on different CPUs that must have synchronized access to shared system data.

On VAX systems, the code threads that run concurrently on different processors synchronize through instructions that interlock memory in addition to raising the IPL. Memory interlocks also synchronize access to data shared by an I/O processor and a code thread.

On Alpha systems, access to a data structure that is shared either by executive code running concurrently on different CPUs or by an I/O processor and a code thread must be synchronized through a load-locked/store-conditional sequence.

6.4.2 LDx_L and STx_C Instructions (Alpha Only)

Because Alpha systems do not provide a single instruction that both reads and writes memory or mechanism to interlock memory against other interlocked accesses, you must use other synchronization techniques. Alpha systems provide the load-locked/store-conditional mechanism that allows a sequence of instructions to perform an atomic read-modify-write operation.

Load-locked (LDx_L) and store-conditional (STx_C) instructions guarantee atomicity that is functionally equivalent to that of VAX systems. The LDx_L and STx_C instructions can be used only on aligned longwords or aligned quadwords. The LDx_L and STx_C instructions do not provide atomicity by blocking access to shared data by competing threads. Instead, when the LDx_L instruction executes, a CPU-specific lock bit is set. Before the data can be stored, the CPU uses the STx_C instruction to check the lock bit. If another thread has accessed the data item in the time since the load operation began, the lock bit is cleared and the store is not performed. Clearing the lock bit signals the code thread to retry the load operation. That is, a load-locked/store-conditional sequence tests the lock bit to see whether the store succeeded. If it did not succeed, the sequence branches back to the beginning to start over. This loop repeats until the data is untouched by other threads during the operation.

By using the LDx_L and STx_C instructions together, you can construct a code sequence that performs an atomic read-modify-write operation to an aligned longword or quadword. Rather than blocking other threads' modifications of the target memory, the code sequence determines whether the memory locked by the LDx_L instruction could have been written by another thread during the sequence. If it is written, the sequence is repeated. If it is not written, the store is performed. If the store succeeds, the sequence is atomic with respect to other threads on the same processor and on other processors. The LDx_L and STx_C instructions can execute in any access mode.

Traditional VAX usage is for interlocked instructions to be used for multiprocessor synchronization. On Alpha systems, LDx_L and STx_C instructions implement interlocks and can be used for uniprocessor synchronization. To achieve protection similar to the VAX interlock protection, you need to use memory barriers along with the load-locked and store-conditional instructions.

Some Alpha system compilers make the LDx_L and STx_C instruction mechanism available as language built-in functions. For example, Compaq C on Alpha systems includes a set of built-in functions that provides for atomic addition and for logical AND and OR operations. Also, Alpha system compilers make the mechanism available implicitly, because they use the LDx_L and STx_C instructions to access declared data as requiring atomic accesses in a language-specific way.

6.4.3 Using Interlocked Memory Instructions (Alpha Only)

The Alpha Architecture Reference Manual, Third Edition (AARM) describes strict rules for using interlocked memory instructions. The new Alpha 21264 (EV6) processor and all future Alpha processors are more stringent than their predecessors in their requirement that these rules be followed. As a result, code that has worked in the past, despite noncompliance, could fail when executed on systems featuring the new 21264 processor. Occurrences of these noncompliant code sequences are believed to be rare. Note that the 21264 processor is not supported on versions prior to OpenVMS Alpha Version 7.1--2.

Noncompliant code can result in a loss of synchronization between processors when interprocessor locks are used, or can result in an infinite loop when an interlocked sequence always fails. Such behavior has occurred in some code sequences in programs compiled on old versions of the BLISS compiler, some versions of the MACRO--32 compiler and the MACRO--64 assembler, and in some Compaq C and Compaq C++ programs.

For recommended compiler versions, see Section 6.4.3.5.

The affected code sequences use LDx_L/STx_C instructions, either directly in assembly language sources or in code generated by a compiler. Applications most likely to use interlocked instructions are complex, multithreaded applications or device drivers using highly optimized, hand-crafted locking and synchronization techniques.

6.4.3.1 Required Code Checks

OpenVMS recommends that code that will run on the 21264 processor be checked for these sequences. Particular attention should be paid to any code that does interprocess locking, multithreading, or interprocessor communication.

The SRM_CHECK tool (named after the System Reference Manual, which defines the Alpha architecture) has been developed to analyze Alpha executables for noncompliant code sequences. The tool detects sequences that might fail, reports any errors, and displays the machine code of the failing sequence.

6.4.3.2 Using the Code Analysis Tool

The SRM_CHECK tool can be found in the following location on the OpenVMS Alpha Version 7.2 Operating System CD-ROM:

SYS$SYSTEM:SRM_CHECK.EXE

To run the SRM_CHECK tool, define it as a foreign command (or use the DCL$PATH mechanism) and invoke it with the name of the image to check. If a problem is found, the machine code is displayed and some image information is printed. The following example illustrates how to use the tool to analyze an image called myimage.exe:

$ define DCL$PATH []
$ srm_check myimage.exe

The tool supports wildcard searches. Use the following command line to initiate a wildcard search:

$ srm_check [*...]* -log

Use the -log qualifier to generate a list of images that have been checked. You can use the -output qualifier to write the output to a data file. For example, the following command directs output to a file named CHECK.DAT:

$ srm_check 'file' -output check.dat

You can use the output from the tool to find the module that generated the sequence by looking in the image's MAP file. The addresses shown correspond directly to the addresses that can be found in the MAP file.

The following example illustrates the output from using the analysis tool on an image named SYSTEM_SYNCHRONIZATION.EXE:


 ** Potential Alpha Architecture Violation(s) found in file...
 ** Found an unexpected ldq at 00003618
 0000360C   AD970130     ldq_l          R12, 0x130(R23)
 00003610   4596000A     and            R12, R22, R10
 00003614   F5400006     bne            R10, 00003630
 00003618   A54B0000     ldq            R10, (R11)
 Image Name:    SYSTEM_SYNCHRONIZATION
 Image Ident:   X-3
 Link Time:      5-NOV-1998 22:55:58.10
 Build Ident:   X6P7-SSB-0000
 Header Size:   584
 Image Section: 0, vbn: 3, va: 0x0, flags: RESIDENT EXE (0x880)

The MAP file for system_synchronization.exe contains the following:

   EXEC$NONPAGED_CODE       00000000 0000B317 0000B318 (      45848.) 2 **  5
   SMPROUT         00000000 000047BB 000047BC (      18364.) 2 **  5
   SMPINITIAL      000047C0 000061E7 00001A28 (       6696.) 2 **  5

The address 360C is in the SMPROUT module, which contains the addresses from 0-47BB. By looking at the machine code output from the module, you can locate the code and use the listing line number to identify the corresponding source code. If SMPROUT had a nonzero base, it would be necessary to subtract the base from the address (360C in this case) to find the relative address in the listing file.

Note that the tool reports potential violations in its output. Although SRM_CHECK can normally identify a code section in an image by the section's attributes, it is possible for OpenVMS images to contain data sections with those same attributes. As a result, SRM_CHECK may scan data as if it were code, and occasionally, a block of data may look like a noncompliant code sequence. This circumstance is rare and can be detected by examining the MAP and listing files.

6.4.3.3 Characteristics of Noncompliant Code

The areas of noncompliance detected by the SRM_CHECK tool can be grouped into the following four categories. Most of these can be fixed by recompiling with new compilers. In rare cases, the source code may need to be modified. See Section 6.4.3.5 for information about compiler versions.

Some versions of OpenVMS compilers introduce noncompliant code sequences during an optimization called "loop rotation." This problem can only be triggered in C or C++ programs that use LDx_L/STx_C instructions in assembly language code that is embedded in the C/C++ source using the ASM function, or in assembly language written in MACRO--32 or MACRO--64. In some cases, a branch was introduced between the LDx_L and STx_C instructions.
This can be addressed by recompiling.
Some code compiled with very old BLISS, MACRO--32, or DEC Pascal compilers may contain noncompliant sequences. Early versions of these compilers contained a code scheduling bug where a load was incorrectly scheduled after a load_locked.
This can be addressed by recompiling.
In rare cases, the MACRO--32 compiler may generate a noncompliant code sequence for a BBSSI or BBCCI instruction where there are too few free registers.
This can be addressed by recompiling.
Errors may be generated by incorrectly coded MACRO--64 or MACRO--32 and incorrectly coded assembly language embedded in C or C++ source using the ASM function.
This requires source code changes. The new MACRO--32 compiler flags noncompliant code at compile time.

If the SRM_CHECK tool finds a violation in an image, the image should be recompiled with the appropriate compiler (see Section 6.4.3.5). After recompiling, the image should be analyzed again. If violations remain after recompiling, the source code must be examined to determine why the code scheduling violation exists. Modifications should then be made to the source code.

6.4.3.4 Coding Requirements

The Alpha Architecture Reference Manual describes how an atomic update of data between processors must be formed. The Third Edition, in particular, has much more information on this topic.

Exceptions to the following two requirements are the source of all known noncompliant code:

There cannot be a memory operation (load or store) between the LDx_L (load locked) and STx_C (store conditional) instructions in an interlocked sequence.
There cannot be a branch taken between an LDx_L and an STx_C instruction. Rather, execution must "fall through" from the LDx_L to the STx_C without taking a branch.
Any branch whose target is between an LDx_L and matching STx_C creates a noncompliant sequence. For instance, any branch to "label" in the following example would result in noncompliant code, regardless of whether the branch instruction itself was within or outside of the sequence:
LDx_L Rx, n(Ry) ... label: ... STx_C Rx, n(Ry)

Therefore, the SRM_CHECK tool looks for the following:

Any memory operation (LDx/STx) between an LDx_L and an STx_C
Any branch that has a destination between an LDx_L and an STx_C
STx_C instructions that do not have a preceding LDx_L instruction
This typically indicates that a backward branch is taken from an LDx_L to the STx_C. Note that hardware device drivers that do device mailbox writes are an exception. These drivers use the STx_C to write the mailbox. This condition is found only on early Alpha systems and not on PCI based systems.
Excessive instructions between an LDx_L and an STxC
The AARM recommends that no more than 40 instructions appear between an LDx_l and an STx_c. In theory, more than 40 instructions can cause hardware interrupts to keep the sequence from completing. However, there are no known occurrences of this.

To illustrate, the following are examples of code flagged by SRM_CHECK.

        ** Found an unexpected ldq at 0008291C
        00082914   AC300000     ldq_l          R1, (R16)
        00082918   2284FFEC     lda            R20, 0xFFEC(R4)
        0008291C   A6A20038     ldq            R21, 0x38(R2)

In the above example, an LDQ instruction was found after an LDQ_L before the matching STQ_C. The LDQ must be moved out of the sequence, either by recompiling or by source code changes. (See Section 6.4.3.3.)

        ** Backward branch from 000405B0 to a STx_C sequence at 0004059C
        00040598   C3E00003     br             R31, 000405A8
        0004059C   47F20400     bis            R31, R18, R0
        000405A0   B8100000     stl_c          R0, (R16)
        000405A4   F4000003     bne            R0, 000405B4
        000405A8   A8300000     ldl_l          R1, (R16)
        000405AC   40310DA0     cmple          R1, R17, R0
        000405B0   F41FFFFA     bne            R0, 0004059C

In the above example, a branch was discovered between the LDL_L and STL_C. In this case, there is no "fall through" path between the LDx_L and STx_C, which the architecture requires.

Note

This branch backward from the LDx_L to the STx_C is characteristic of the noncompliant code introduced by the "loop rotation" optimization.

The following MACRO--32 source code demonstrates code where there is a "fall through" path, but this case is still noncompliant because of the potential branch and a memory reference in the lock sequence.

        getlck: evax_ldql  r0, lockdata(r8)  ; Get the lock data
                movl       index, r2         ; and the current index.
                tstl       r0                ; If the lock is zero,
                beql       is_clear          ; skip ahead to store.
                movl       r3, r2            ; Else, set special index.
        is_clear:
                incl       r0                ; Increment lock count
                evax_stqc  r0, lockdata(r8)  ; and store it.
                tstl       r0                ; Did store succeed?
                beql       getlck            ; Retry if not.

To correct this code, the memory access to read the value of INDEX must first be moved outside the LDQ_L/STQ_C sequence. Next, the branch between the LDQ_L and STQ_C, to the label IS_CLEAR, must be eliminated. In this case, it could be done using a CMOVEQ instruction. The CMOVxx instructions are frequently useful for eliminating branches around simple value moves. The following example shows the corrected code:

                movl       index, r2         ; Get the current index
        getlck: evax_ldql  r0, lockdata(r8)  ; and then the lock data.
                evax_cmoveq r0, r3, r2       ; If zero, use special index.
                incl       r0                ; Increment lock count
                evax_stqc  r0, lockdata(r8)  ; and store it.
                tstl       r0                ; Did write succeed?
                beql       getlck            ; Retry if not.

6.4.3.5 Compiler Versions

This section contains information about versions of compilers that may generate noncompliant code sequences and the recommended versions to use when recompiling.

Table 6-1 contains information for OpenVMS compilers.

**Table 6-1 OpenVMS Compilers**
Old Version	Recommended Minimum Version
BLISS V1.1	BLISS V1.3
DEC C V5.x	Compaq C V6.0
DEC C++ V5.x	Compaq C++ V6.0
DEC Pascal V5.0-2	Compaq Pascal V5.1-11
MACRO--32 V3.0	V3.1 for OpenVMS Version 7.1--2 V4.1 for OpenVMS Version 7.2
MACRO--64 V1.2	See below.

Current versions of the MACRO--64 assembler may still encounter the loop rotation issue. However, MACRO--64 does not perform code optimization by default, and this problem occurs only when optimization is enabled. If SRM_CHECK indicates a noncompliant sequence in the MACRO--64 code, it should first be recompiled without optimization. If the sequence is still flagged when retested, the source code itself contains a noncompliant sequence that must be corrected.

Contents

Index