OpenVMS on HP Integrity servers

The Oracle Rdb database engine generates platform-specific executable code subroutines at run time. On VAX systems, VAX executable instructions are generated. On Alpha systems, Alpha executable instructions are generated. When Oracle Rdb was ported to the HP Integrity platform, the ability to execute run-time created subroutines was required as well. This paper discusses background of the original interpretation implementation with a later transaction to native Itanium instruction generation.

Code Generation

When a user’s request (such as the SQL statement “SELECT * FROM CUSTOMERS WHERE CITY = ‘ESPOO’ OR CITY = ‘SALIDA’ ORDER BY LAST_NAME”) is passed to the database engine, a number of executable subroutines are created, at run-time, to perform various request-specific functions. These functions may include copying data fields, performing null-field handling, doing data field comparisons, and so on. This run-time request-specific code is an integral part of Oracle Rdb’s database engine and helps to provide high levels of performance.

On VAX systems, such subroutines would contain VAX instructions (such as MOVC3, RET, MOVL and so on). When Oracle Rdb was ported to the OpenVMS and Tru64 environments for the Alpha platform, the code generation capabilities were extended to create Alpha instructions (such as CMOV, LDA, STQ, and so on). The logic of the subroutines, for the most part, is the same between the platforms; just the executable instructions, the register usage, and the system calling standard are different.

The following example contains a sequence of run-time generated instructions on a VAX system. Note the CISC architecture of the VAX computer with use of complex instructions that contain multiple operands along several addressing modes along with relatively high code density.

The next example contains a sample of run-time generated code on an Alpha system. Note that the Alpha is a more traditional “RISC”-style architecture where instructions are simpler, fixed size, and that the memory reference instructions either read or write memory, but do not atomically read and update memory in a single instruction.

Interpretation engine

Oracle Rdb was ported to run on the Microsoft Windows NT environment running on Intel x86 and Alpha processors (this product was, however, never released for production use). At that point in time, in order to rapidly complete the porting effort for the Intel x86 platform, an interpretation engine was created that could interpret those portions of the Alpha instruction set generated at run-time by Oracle Rdb. This approach allowed a single piece of code (the interpretation engine) to be written and, more importantly, debugged without having to change the instruction generation machinery within the Oracle Rdb database engine (which continued to generate subroutines using the Alpha instruction set).

Over time, an expanded set of “rich” instructions were added to the code generation capabilities on the Intel x86 platform. These instructions were intended to perform more complex actions as one “pseudo” instruction, replacing, in some cases, a large number of Alpha instructions in the code stream. Execution of these “rich” instructions could be more optimized as compared to individually executing long sequences of individual instructions. Approximately 150 of these “rich” instructions were eventually implemented.

Though not used on Alpha and VAX systems (the supported platforms for Oracle Rdb), this interpretation engine remained part of the Oracle Rdb source code and lay “dormant” for many years.

Itanium Emerges

With the advent of OpenVMS for the Integrity Server platform, Oracle chose to port the Rdb database engine to the Integrity Server for the OpenVMS operating system. Though native language compilers were available (primarily, in the case of Oracle Rdb, BLISS, C++, C and MACRO32), there was no immediate capability for the Oracle Rdb engine to create executable instructions for the Itanium architecture.
At this point, the interpretation engine was pressed in to service again. Most of the code had not even been compiled in over 10 years. But with a bit of effort (mostly correcting issues related to improved C compilers with enhanced detection for latent bugs), it was able to successfully execute once again. Debugging effort was required to get it working completely properly but it did prove to be a valuable tool that allowed a significantly more rapid production delivery and deployment of Oracle Rdb Release 7.2 on the Itanium platform.

Performance

Overall, we anticipated that, while the performance of the interpreted code would never be as good as a native executable code subroutine, the Integrity system, as a whole, would perform at least comparable to “equal” Alpha systems. This was the case for the vast majority of applications and systems that we analyzed. CPU, memory and IO performance tended to provide a balanced system that performs very well when running customer applications. A few applications, however, spent a significant amount of time executing the run-time generated code and these applications were, in some cases, significantly slower than we, and our customers, would have preferred.

In particular, one major customer application was generally as good or better performing on Itanium systems than on Alpha systems. But several significant queries of the application were both frequently executed and much slower once migrated from Alpha to Itanium. Analysis revealed that most of the additional CPU time was spent in the interpretation engine while running particular parts of the application.

A major effort was spent in analysis and tuning of the interpretation engine itself. This tuning yielded performance improvements of over 20% in some cases. This was, however, not nearly enough (and, regrettably, not even in the same order of magnitude required). Further analysis indicated that there was likely no way to make the interpretation model execute fast enough to meet our customers’ needs in all cases.

A New Direction

It was felt that the investment required to enhance the Oracle Rdb database engine to add another set of code generation capabilities (in addition to VAX, Alpha and “rich” instructions) for native Itanium would consume significant resources for development and testing and likely could not be completed in time for this particular customer’s production deployment schedule. There were too many locations in the code that would be required to be changed to produce instructions for yet another architecture. Our experience with the port to Alpha indicated that there would be substantial human resources required to produce and debug the resultant code.

Based on this analysis, the concept of “compiling”, at run time, a complete subroutine from a mixture of Alpha and “rich” instructions in to native Itanium executable code was born. The design that we arrived at is not dissimilar to the JAVA machinery’s “Just In Time” (also known as JIT) compiler available on many platforms: Input a stream of generic and platform-independent instructions and create platform-specific executable code which is expected to perform much better than interpreting the “pseudo” instructions.

Initial prototypes were developed to create and call an executable stream of Itanium instructions. The success of these tests supported the idea that it was viable for Oracle Rdb to be able to create native subroutines and to call such code at run time while on the Itanium platform.

High Level Design

The basic operation for what we originally called the “interp compiler” (based on the idea that this was a compiler to replace the interpretation engine) was to pass a pointer to a complete subroutine of compiled Alpha and “rich” code and then attempt to compile it completely in to a native Itanium instruction subroutine. If the compilation was successful (all instructions were able to be compiled) then a pointer to a procedure descriptor for the generated routine was returned with the low bit set (i.e., an odd value). If, however, the compilation could not be complete (if instructions were found that were not able to be compiled), the original routine address was returned (with the low bit clear as the routine had been originally allocated on a longword boundary).

Later, when the subroutine was to be called, the low bit of the routine’s address was first evaluated. If clear, the existing interpretation engine was called to execute the subroutine. If the routine address was odd (indicating that the low bit was set), the routine was called directly (after clearing the low bit) to be executed “native”.

In this way, the “interp compiler” could start small (only able to compile a few instruction types) and then grow (by adding the ability to compile more and more instructions and addressing modes and so on) all while the database engine continued to operate correctly (presumably as more and more subroutines could be compiled, execution performance would continue to improve). This made it possible to continue to execute and test Oracle Rdb while the “interp compiler” was being actively developed. Without this model it was have been a much slower process in that the “interp compiler” would have had to be entirely complete before we could even begin to test it.

Itanium Architecture

Significant attributes of the Itanium architecture that pertain to the “interp compiler” include:

Producing code for the Itanium architecture is a fair measure more complex than, for example, code generation on the Alpha architecture. A significant set of rules and requirements are imposed in terms of which instruction type may be used in which bundle slot depending on the specified template, the use of “stops” to indicate that the results of prior instructions are required by following instructions, and so on. A larger number of registers provides major benefits in regards to having more scratch registers available for intermediate results. And the use of predicates can, in some cases, drastically reduce the number of branches taken which can, in turn, improve performance by reducing “wasted” processor cycles due to “bubbles”.

Additional steps were required after creating code. Because the Itanium instruction cache (I cache) and data cache (D cache) are not synchronized, after new executable code is created, the hardware must be notified by flushing the instruction cache for the memory addresses of the newly created code. This was accomplished most easily by calling the SYS$PAL_IMB system service specifying the starting address and the length of the generated code. The system service invalidates each I cache line as needed and ensures that the data and instruction caches are correctly synchronized prior to attempting to execute the new instructions.

Starting Simple

The Oracle Rdb “interp compiler” is implemented as a routine written in BLISS (the primary implementation language utilized by Oracle Rdb for both ease of development and resultant product performance). Generation of instructions is accomplished though a set of macros that implement primitive operations that are generally produced as one or more instructions within one or more bundles. The original project goal was to have macros that would create one instruction per bundle. Over time, more and complex macros were created to perform different functions and to create bundles with more instruction sequences to help produce faster and denser code streams.

A simple macro might, for example, produce a single ld4 instruction (to fetch 4 bytes from memory) alone in a single bundle (nop instructions would occupy the remaining two slots). Another level of complexity might be a single macro to create a pair of memory load and store instructions in a bundle. A more complex macro may implement a call sequence where output parameters are created, registers are saved, a procedure descriptor read, the routine called, and then after the call registers are restored. This sequence would require a modest number of bundles to implement.

Branches

One area of complexity is forward and backward branches within the code. The input subroutine may contain both “rich” and Alpha instructions that change the flow of control via conditional and unconditional branches. Branches are self-relative within the context of the input routine. To preserve the correct branch destination, a table is constructed that contains the address of the bundle containing the branch along with the original branch offset. Another table is maintained to associate the original instruction location along with the location of the generated code. After code generation is complete, branches are “fixed up” to adjust the destination offset to the correct destination bundle.

For performance, the Itanium architecture includes “hints” for most types of branch instructions. These hints allow a compiler to indicate additional information to the hardware in regards to how branches are expected to execute. The hardware, in turn, may use this information to predict how the flow of control is expected to operate and can allocate resources more efficiently and, ultimately, execute the whole of the code stream faster. For example, one such branch “hint” type would be “dynamic, predicted not taken”. This hint implies that the compiler expects that the conditional branch will generally not be taken but the hardware should allocate prediction resources (such a history of branches taken or not taken at this location).

Based on both performance testing and research papers evaluated, the “interp compiler” utilizes these branch hints in the generated instruction stream. Unconditional branches are specified as “static, taken”, most conditional branches are specified as “dynamic, predicted not taken”. Exceptions to these rules are backward (typically involved in a loop) branches which likely are specified as “dynamic predicted taken”.

Exception Handling

In order to allow OpenVMS exception handling mechanisms to function properly, the “interp compiler” must “register” each generated routine with the operating system. This registration includes identifying any “unwind” information specifically regarding the routine’s first and last instruction, the length of the routine’s prologue and registers and stack usage. Because the created code will be both created and executed in the processor’s executive mode, a kernel mode image exit handler is utilized to un-register the generated code during image run down. Without having such unwind information registered with the operating system, exception handling is not possible; otherwise an exception from the generated code, or code that is called by the generated code, cannot be handled and results in, depending on the mode and context, either process or image termination.

The OpenVMS calling standard uses a variant of the common Itanium standard which includes a moderately complex set of rules for representing unwind information. This scheme includes a compressed variable-length fields and a dense structure. Within the interp compiler, this “signature” information is produced at the end of executable code creation for each routine.

Simple Code Sequence Examples

The following code sequence shows the original “rich” instruction (indicated by longword address and content fields at the left part of the line) CLR_Q (clear quadword) along with its single operand followed by the generated Itanium instructions (indicated by quadword content and instruction addresses) created for the “rich” instruction.

The operation’s addressing mode is evaluated as an offset from the global register r2 (this register maps to Alpha register R12 within Oracle Rdb). The offset is created by adding 8000 to the value 0040 and then sign extending from 16 to 64 bits. Next, r0 (which is always read as the value zero) is written to the destination address, thus clearing it.
It would obviously be possible to combine these two instructions in to a single bundle. However, the construction of the interp compiler is such that the addressing steps are evaluated first and then the operation steps are produced. While it would be viable to perform a second pass to combine the instructions in to a single bundle, it has not yet been a high priority for execution optimization.

An additional concern for such optimization is that significant amounts of time could be spent in the interp compiler that could exceed the potential benefits for performance improvements of the generated code. In this case, for example, the stall caused by the memory reference will dramatically overshadow any other optimizations possible for the two instructions which still require a stop between them (as the first updates r14 which is used as input to the second).

In the next example, the MOV_Q (move quadword) “rich” instruction has two operands (source and destination address information). The source location is indicated as an offset (00000050) from a register (Alpha register R16 which is translated as Itanium register r32; the first input parameter to the routine as specified in the OpenVMS calling standard). The destination is an offset (0040) from register r2 (strictly, the offset is FFFF8040 from register R30). The interp compiler detects that both source and destination addresses are likely to be at least quadword aligned and produces a single ld8 instruction to read the source quadword and a single st8 instruction to write to the destination.

More Complex Examples

In the following example generated code, the “rich” instruction MOV_NB_BR_CLR is used to move a null bit (an indication of a database field within a row containing a value) to a byte and then branch if the bit was clear (indicating in this case that the field was not null). Note that there are 4 operands to the “rich” instruction. The interp compiler turns this “rich” instruction in to 8 Itanium instructions stored in three bundles.

The first two instructions add the offset 00CA to r32 (the first input parameter to the subroutine) and then fetch a byte from the resultant location. The next two instructions move the offset 3F8030 to r14 and then add r30 to r14 to result in the output address of the null bit.

The fifth and sixth instructions first extract the bit specified in the first operand of the rich instruction and then test the bit to determine if it is set or clear. The extr.u instruction extracts one bit from the specified position (4 in this case) and stores the result starting at bit 0 in the register r15. In the next instruction (tbit.z), the predicate register p6 will be set if the null bit is equal to zero and will be cleared if the null bit is not equal to zero.

Finally the resultant null byte is stored. If the null bit is clear (indicating that the database field has a value), a branch is to be taken. The branch displacement is a sign-extended 21-bit value indicating a number of longwords. Here, it is a forward branch of 4 longwords. In the instruction stream, if predicate p6 is true (which indicates that the null bit was not set), a relative branch is taken. Otherwise, if predicate p6 is false, the branch is not taken and execution continues at the first instruction of the following bundle.

Within the database environment, string operations (moving, changing and comparing) are common. The following example demonstrates the compiled code for the CMP_S “rich” instruction which is used to compare two fixed the length strings. The first operand is the number of bytes to compare. The second operand is the address of the first string and the third operand is the address of the second string. CMP_S returns either -1, 0 or 1 to the return status register (r8 which maps to Alpha R0) depending on the relationship (less than, equal, greater than) of the two strings.

The loop count application register ac.lc is used in conjunction with the br.cloop (branch counted loop) instruction to implement the main loop construct. Within the body of the loop, two bytes are fetched with a post increment of the source registers. Then the cmp.eq instruction is used to compare the values of the bytes for equality. Predicate register p7 is set if the comparison detects inequality. Un-equal values result in a branch out of the loop. Otherwise (in the case of the bytes being equal to each other), a backwards branch is taken by the br.cloop instruction to the prior bundle to fetch the next bytes.

When the bytes are known not equal, they are compared to each other with the cmp.lt instructions. If the strings are equal (when the loop executes to completion and no different bytes had been detected), r8 remains as zero. If the last bytes fetched are not equal (indicating that the loop did not complete and a difference was found), r8 is set to either -1 or 1. Note that within the final two bundles, the comparisons are done in parallel (the instructions can execute simultaneously because they do not depend on each other) and then the two moves are executed in parallel. The moves to r8 can be executed simultaneously because at most one of them will produce a result because predicates p6 and p7 are mutually exclusive – in no case will both be set. It is possible that neither is set (when the strings are equal) and r8 will remain 0.

In the case of the SET_T (set text) instruction, one or more bytes of a constant value are written to memory starting at a specified location. The interp compiler attempts to optimize these memory writes by performing overlapped operations and performing as few writes as possible by promoting the size of the memory reference based on the minimum alignment of the read and write stream pointers. Two pointers are used, offset by 8 bytes, to allow multiple st8 instructions to be executed in parallel. Post-increment instruction modes are used to update the output pointers in order to avoid additional instructions that would otherwise be required in order to increment the pointers. “Tail” writes of one, two or four bytes are used to complete the sequence.

If the byte count for the fill was larger, a loop would have been generated to perform the fill. In addition, the interp compiler produces, as needed, code to perform one, two or four byte writes prior to the loop and then again after the loop in order to align the output pointer on an 8 byte boundary so that as few memory writes as possible are created.

Accessing Unaligned Data

Both the Alpha and Itanium systems impose a severe performance penalty when the processor attempts to perform an unaligned memory reference. An unaligned reference, for example, would be to attempt to fetch a longword (4 bytes) from a virtual address where the two lowest bits are not clear (i.e. not aligned on a 4 byte boundary). And the penalty on OpenVMS Itanium systems is significantly higher than it is on Alpha systems. Thus, avoiding alignment faults has an even greater benefit (for all processes on the system) on Itanium systems.

The “interp compiler” attempts to detect memory references that are not naturally aligned and produces a longer code sequence to perform the memory read or write operation without the overhead of an alignment fault.

For “rich” instructions, the assumption is made that register addresses are naturally aligned on quadword (8 byte) boundaries. Offset values can be then evaluated to determine if the resultant memory address is aligned or not. When an unaligned reference is predicted, a sequence of instructions can be generated to avoid the fault. For example, a load of a quadword that is located on a longword boundary an be accomplished by fetching the two longwords and then merging them together with the “mix4.r” instruction:

The following sequence (adapted from analysis of code generated by the HP GEM compiler backend) is a longword store where the destination is predicted to not be naturally aligned. The least significant bit of the address (presented in r3) is tested. If it is set, the address is byte aligned and p7 is set; otherwise the address is word aligned and p6 is set. In the case of byte alignment, a single byte is stored and the address is incremented (thus aligned on a word boundary) and the output value is shifted 8 bits to the right. A word is then stored and the output is shifted right 16 bits. Finally, if the original address was word aligned, the final word is written, otherwise the final byte is written. This sequence results in either two (for word alignment) or three (for byte alignment) memory writes.

Optimizations

In some situations, the generated Itanium code sequences will execute faster than corresponding sequences on Alpha. For example, in cases of filling or comparing a relatively few bytes of memory, the code generated for Itanium includes a sequence of memory stores or fetches in-line while the Alpha code calls to the operating system routines OTS$FILL or OTS$CMP variants. The overhead of the call in some instances will be greater than the actual memory references.

In other cases, the Itanium instruction set provides instructions that perform operations that require a sequence of instructions on the Alpha platform. For example, the “mux1@rev” instruction can be used to reverse the order of bytes within a quadword. Within Oracle Rdb on the Alpha platform, this operation is accomplished in a series of independent shift and mask instructions. This byte reversal is, for example, used when constructing index keys so performance is an important consideration as this may be a commonly executed sequence.

Optimization does tend to be a repetitive and, at least based on our observations, a never-ending process. Over time, code sequences are compressed and improved with a goal of reducing latency in regards to the CPU clock rate and memory access latencies.

For example performance analysis both “by eye” and by processor cycle sampling lead to reductions in code steams by often “combining” addressing operands in to a single bundle as in the following example; initially the operands (moving the address values to r14 and r15) would have required two bundles.

Instruction Execution Frequency

As part of a performance analysis sub-project, we created an instrumented interpretation engine that sent, via an OpenVMS mailbox, instruction execution information from all processes on a system to a separate collector process that captured instruction execution counts during a portion of an Oracle Rdb regression test run. The following table includes the top 20 instructions and the number of times each instruction was executed. In the table, the indication “RICH” indicates a “rich” instruction and “EVAX” indicates an Alpha instruction.

Instruction Mnemonic	Execution Count
RICH_MOV_B	60,974,337
RICH_MOV_NB_BR_CLR	60,537,063
RICH_MOV_L	47,099,034
RICH_MOV_Q	42,723,231
RICH_B_BR_SET	32,023,634
RICH_BRANCH	23,386,664
RICH_MOV_S	21,233,064
EVAX_LDA	21,169,625
EVAX_BIS	17,947,775
RICH_MOV_W	16,446,120
RICH_CMP_L	16,401,014
EVAX_JSR	15,925,377
RICH_MOV_B_BR_SET	14,723,867
RICH_EXE_ACTION	14,336,902
RICH_MOV_NBIT_BR_SET	14,138,224
EVAX_BR	13,827,371
RICH_OR_B_BR_SET	10,937,403
RICH_CNV_SORT_N	8,293,658
RICH_STALL	7,429,742
EVAX_LDAH	6,818,905

This data was, in turn, used as a guide for which instructions should be first considered for increased optimization by the interp compiler. The idea is that an instruction executed several times an hour has a marginal impact on performance as compared with an instruction executed thousands of times per second.

Database Performance Improvements

The creation and optimization of the Oracle Rdb “interp compiler” has been an iterative affair. Initial performance improvements from the interp compiler allowed applications running on Itanium systems to run at least as fast as on Alpha systems. Further optimizations (including reducing memory references, eliminating unneeded “stops”, avoiding alignment faults, and so on) have dramatically improved code quality and yielded even better performance. In some cases, application performance has improved by a factor of 3 due to the interp compiler generating native instructions.

And Oracle continues to measure and analyze performance of the Oracle Rdb database product family on the HP OpenVMS operating system for the Integrity Server platform. An extensive set of regression tests are continuously run in our development environment to help ensure correctness of the generated code. We are also in constant contact with our customer based to help understand their performance challenges. This input helps us decide where to focus our optimization efforts to everyone’s benefit.

Models and Examples Followed

A number of different resources were referenced in regards to code generation. In addition to the (voluminous) Intel documentation of the Itanium architecture, we also utilized the compiler machine code listings from high level language compilers on OpenVMS (for example, BLISS, C and MACRO32 which all use GEM code generator and the C++ compiler which uses an Intel code generator).

Both the OpenVMS debugger and system dump analyzer include the ability to format an instruction stream which helped significantly when we were learning the intricate details of the Itanium architecture.

The OpenVMS listings include the MACRO2000 facility which implements the MACRO32 compiler. This was used in many cases as a template for code generation for complex alpha instructions (such as ZAP and MSK). The internet also proved to be an excellent resource for example instruction streams and discussions of Itanium performance in regards to the use of the architecture.

The Intel and OpenVMS documentation was referenced extensively while we were creating the unwind information tables for generated code. And the OpenVMS calling standard manual was invaluable in regards of register usage rules.

Credit and Thanks

A large number of people devoted a great many hours to this project of developing the Oracle Rdb “Just In Time” code generator for the Itanium systems. It is not possible to remember or credit everyone who was involved. But special thanks and recognition are due to engineering members of both HP and Oracle including: John Reagan, Jeanie Leab, Guenther Froehlin, Greg Jordan, Christian Moser, Burns Fisher, Ian Smith, Martin Ramshaw, and Richard Bishop.

For more information

The Oracle Rdb web site is accessible on the internet at www.oracle.com/rdb. For more information about the Intel Itanium architecture and instruction set, visit www.intel.com. For more information about the HP OpenVMS system, visit www.hp.com and www.hp.com/go/openvms

HP OpenVMS Systems

The Oracle Rdb Run-Time Code Generator for the OpenVMS Itanium Platform

Introduction