 |
Introduction
The Oracle Rdb database engine generates platform-specific
executable code subroutines at run time. On VAX systems, VAX executable
instructions are generated. On Alpha systems, Alpha executable instructions are
generated. When Oracle Rdb was ported to the HP Integrity platform, the ability
to execute run-time created subroutines was required as well. This paper
discusses background of the original interpretation implementation with a later
transaction to native Itanium instruction generation.
Code Generation
When a user’s request (such as the SQL statement “SELECT *
FROM CUSTOMERS WHERE CITY = ‘ESPOO’ OR CITY = ‘SALIDA’ ORDER BY LAST_NAME”) is
passed to the database engine, a number of executable subroutines are created,
at run-time, to perform various request-specific functions. These
functions may include copying data fields, performing null-field handling, doing
data field comparisons, and so on. This run-time request-specific code is
an integral part of Oracle Rdb’s database engine and helps to provide high
levels of performance.
On VAX systems, such subroutines would contain VAX
instructions (such as MOVC3, RET, MOVL and so on). When Oracle Rdb was
ported to the OpenVMS and Tru64 environments for the Alpha platform, the code
generation capabilities were extended to create Alpha instructions (such as CMOV,
LDA, STQ, and so on). The logic of the subroutines, for the most part, is
the same between the platforms; just the executable instructions, the register
usage, and the system calling standard are different.
The following example contains a sequence of run-time
generated instructions on a VAX system. Note the CISC architecture of the
VAX computer with use of complex instructions that contain multiple operands
along several addressing modes along with relatively high code density.
00FB1B18: .WORD ^M<R2,R3,R4,R5,R6,R7>
00FB1B1A: MOVL B^04(AP), R6
00FB1B1E: MOVAB @#00FB1B68, R7
00FB1B25: MOVC5 S^#00, (SP), (SP), S^#05, B^01(R7)
00FB1B2C: MOVL B^69(R6), R0
00FB1B30: ROTL S^#08, R0, R0
00FB1B34: BICL3 #FF00FF00, R0, R1
00FB1B3C: XORL2 R1, R0
00FB1B3F: ROTL S^#10, R0, R0
00FB1B43: XORB2 #80, R0
00FB1B47: XORL3 R1, R0, B^02(R7)
00FB1B4C: CLRB B^01(R7)
00FB1B4F: BBC S^#02, W^00EF(R6), 00FB1B5C
00FB1B55: XORB2 S^#01, B^01(R7)
00FB1B59: CLRL B^02(R7)
00FB1B5C: RET
The next example contains a sample of run-time generated
code on an Alpha system. Note that the Alpha is a more traditional “RISC”-style
architecture where instructions are simpler, fixed size, and that the memory
reference instructions either read or write memory, but do not atomically read
and update memory in a single instruction.
01A060AC A00C8138 LDL R0,#XF8138(R12)
01A060B0 A02C808C LDL R1,#XF808C(R12)
01A060B4 44010400 BIS R0,R1,R0
01A060B8 F4000004 BNE R0,#X0000004 -> 1A060CC
01A060BC A00C8064 LDL R0,#XF8064(R12)
01A060C0 A02C813C LDL R1,#XF813C(R12)
01A060C4 400105A0 CMPEQ R0,R1,R0
01A060C8 F400000D BNE R0,#X000000D -> 1A06100
01A060CC A00C8110 LDL R0,#XF8110(R12)
01A060D0 A02C8090 LDL R1,#XF8090(R12)
01A060D4 44010400 BIS R0,R1,R0
01A060D8 F400000F BNE R0,#X000000F -> 1A06118
01A060DC A36B0144 LDL R27,#X00144(R11)
01A060E0 220C8118 LDA R16,#XF8118(R12)
01A060E4 223F001F LDA R17,#X0001F(R31)
01A060E8 224C8068 LDA R18,#XF8068(R12)
01A060EC 233F0003 LDA R25,#X00003(R31)
01A060F0 A75B0050 LDQ R26,#X00050(R27)
01A060F4 A77B0058 LDQ R27,#X00058(R27)
01A060F8 6B5A4000 JSR R26,(R26) "ots$strcmp_eqls"
01A060FC E4000006 BEQ R0,#X0000006 -> 1A06118
01A06100 201F0001 LDA R0,#X00001(R31)
01A06104 A75E0008 LDQ R26,#X00008(SP)
01A06108 A7BE0010 LDQ FP,#X00010(SP)
01A0610C 23DE0020 LDA SP,#X00020(SP)
01A06110 6BFA8001 RET R31,(R26),1
01A06114 00000000 CALL_PAL HALT
01A06118 47FF0400 CLR R0
01A0611C C3FFFFF9 BR #XFFFFFF9 -> 1A06104
Interpretation engine
Oracle Rdb was ported to run on the Microsoft Windows NT environment running
on Intel x86 and Alpha processors (this product was, however, never released for
production use). At that point in time, in order to rapidly complete the porting
effort for the Intel x86 platform, an interpretation engine was created that
could interpret those portions of the Alpha instruction set generated at
run-time by Oracle Rdb. This approach allowed a single piece of code (the
interpretation engine) to be written and, more importantly, debugged without
having to change the instruction generation machinery within the Oracle Rdb
database engine (which continued to generate subroutines using the Alpha
instruction set).
Over time, an expanded set of “rich” instructions were added to the code
generation capabilities on the Intel x86 platform. These instructions were
intended to perform more complex actions as one “pseudo” instruction, replacing,
in some cases, a large number of Alpha instructions in the code stream.
Execution of these “rich” instructions could be more optimized as compared to
individually executing long sequences of individual instructions. Approximately
150 of these “rich” instructions were eventually implemented.
Though not used on Alpha and VAX systems (the supported platforms for Oracle
Rdb), this interpretation engine remained part of the Oracle Rdb source code and
lay “dormant” for many years.
Itanium Emerges
With the advent of OpenVMS for the Integrity Server platform, Oracle chose to
port the Rdb database engine to the Integrity Server for the OpenVMS operating
system. Though native language compilers were available (primarily, in the case
of Oracle Rdb, BLISS, C++, C and MACRO32), there was no immediate capability for
the Oracle Rdb engine to create executable instructions for the Itanium
architecture.
At this point, the interpretation engine was pressed in to service again. Most
of the code had not even been compiled in over 10 years. But with a bit of
effort (mostly correcting issues related to improved C compilers with enhanced
detection for latent bugs), it was able to successfully execute once again.
Debugging effort was required to get it working completely properly but it did
prove to be a valuable tool that allowed a significantly more rapid production
delivery and deployment of Oracle Rdb Release 7.2 on the Itanium platform.
Performance
Overall, we anticipated that, while the performance of the interpreted code
would never be as good as a native executable code subroutine, the Integrity
system, as a whole, would perform at least comparable to “equal” Alpha systems.
This was the case for the vast majority of applications and systems that we
analyzed. CPU, memory and IO performance tended to provide a balanced system
that performs very well when running customer applications. A few applications,
however, spent a significant amount of time executing the run-time generated
code and these applications were, in some cases, significantly slower than we,
and our customers, would have preferred.
In particular, one major customer application was generally as good or better
performing on Itanium systems than on Alpha systems. But several significant
queries of the application were both frequently executed and much slower once
migrated from Alpha to Itanium. Analysis revealed that most of the additional
CPU time was spent in the interpretation engine while running particular parts
of the application.
A major effort was spent in analysis and tuning of the interpretation engine
itself. This tuning yielded performance improvements of over 20% in some cases.
This was, however, not nearly enough (and, regrettably, not even in the same
order of magnitude required). Further analysis indicated that there was likely
no way to make the interpretation model execute fast enough to meet our
customers’ needs in all cases.
A New Direction
It was felt that the investment required to enhance the Oracle Rdb database
engine to add another set of code generation capabilities (in addition to VAX,
Alpha and “rich” instructions) for native Itanium would consume significant
resources for development and testing and likely could not be completed in time
for this particular customer’s production deployment schedule. There were too
many locations in the code that would be required to be changed to produce
instructions for yet another architecture. Our experience with the port to Alpha
indicated that there would be substantial human resources required to produce
and debug the resultant code.
Based on this analysis, the concept of “compiling”, at run time, a complete
subroutine from a mixture of Alpha and “rich” instructions in to native Itanium
executable code was born. The design that we arrived at is not dissimilar to the
JAVA machinery’s “Just In Time” (also known as JIT) compiler available on many
platforms: Input a stream of generic and platform-independent instructions and
create platform-specific executable code which is expected to perform much
better than interpreting the “pseudo” instructions.
Initial prototypes were developed to create and call an executable stream of
Itanium instructions. The success of these tests supported the idea that it was
viable for Oracle Rdb to be able to create native subroutines and to call such
code at run time while on the Itanium platform.
High Level Design
The basic operation for what we originally called the “interp compiler”
(based on the idea that this was a compiler to replace the interpretation
engine) was to pass a pointer to a complete subroutine of compiled Alpha and
“rich” code and then attempt to compile it completely in to a native Itanium
instruction subroutine. If the compilation was successful (all instructions were
able to be compiled) then a pointer to a procedure descriptor for the generated
routine was returned with the low bit set (i.e., an odd value). If, however, the
compilation could not be complete (if instructions were found that were not able
to be compiled), the original routine address was returned (with the low bit
clear as the routine had been originally allocated on a longword boundary).
Later, when the subroutine was to be called, the low bit of the routine’s
address was first evaluated. If clear, the existing interpretation engine was
called to execute the subroutine. If the routine address was odd (indicating
that the low bit was set), the routine was called directly (after clearing the
low bit) to be executed “native”.
In this way, the “interp compiler” could start small (only able to compile a
few instruction types) and then grow (by adding the ability to compile more and
more instructions and addressing modes and so on) all while the database engine
continued to operate correctly (presumably as more and more subroutines could be
compiled, execution performance would continue to improve). This made it
possible to continue to execute and test Oracle Rdb while the “interp compiler”
was being actively developed. Without this model it was have been a much slower
process in that the “interp compiler” would have had to be entirely complete
before we could even begin to test it.
Itanium Architecture
Significant attributes of the Itanium architecture that pertain to the
“interp compiler” include:
-
128 bit “bundle” containing 3 instructions that will
all be executed
-
Multiple execution unit type combination selected via a
“template” within a bundle
-
Predicate registers that control if an instruction will
have an effect or not
Producing code for the Itanium architecture is a fair
measure more complex than, for example, code generation on the Alpha
architecture. A significant set of rules and requirements are imposed in terms
of which instruction type may be used in which bundle slot depending on the
specified template, the use of “stops” to indicate that the results of prior
instructions are required by following instructions, and so on. A larger number
of registers provides major benefits in regards to having more scratch registers
available for intermediate results. And the use of predicates can, in some
cases, drastically reduce the number of branches taken which can, in turn,
improve performance by reducing “wasted” processor cycles due to “bubbles”.
Additional steps were required after creating code. Because
the Itanium instruction cache (I cache) and data cache (D cache) are not
synchronized, after new executable code is created, the hardware must be
notified by flushing the instruction cache for the memory addresses of the newly
created code. This was accomplished most easily by calling the SYS$PAL_IMB
system service specifying the starting address and the length of the generated
code. The system service invalidates each I cache line as needed and ensures
that the data and instruction caches are correctly synchronized prior to
attempting to execute the new instructions.
Starting Simple
The Oracle Rdb “interp compiler” is implemented as a
routine written in BLISS (the primary implementation language utilized by Oracle
Rdb for both ease of development and resultant product performance). Generation
of instructions is accomplished though a set of macros that implement primitive
operations that are generally produced as one or more instructions within one or
more bundles. The original project goal was to have macros that would create one
instruction per bundle. Over time, more and complex macros were created to
perform different functions and to create bundles with more instruction
sequences to help produce faster and denser code streams.
A simple macro might, for example, produce a single ld4
instruction (to fetch 4 bytes from memory) alone in a single bundle (nop
instructions would occupy the remaining two slots). Another level of complexity
might be a single macro to create a pair of memory load and store instructions
in a bundle. A more complex macro may implement a call sequence where output
parameters are created, registers are saved, a procedure descriptor read, the
routine called, and then after the call registers are restored. This sequence
would require a modest number of bundles to implement.
Branches
One area of complexity is forward and backward branches
within the code. The input subroutine may contain both “rich” and Alpha
instructions that change the flow of control via conditional and unconditional
branches. Branches are self-relative within the context of the input routine. To
preserve the correct branch destination, a table is constructed that contains
the address of the bundle containing the branch along with the original branch
offset. Another table is maintained to associate the original instruction
location along with the location of the generated code. After code generation is
complete, branches are “fixed up” to adjust the destination offset to the
correct destination bundle.
For performance, the Itanium architecture includes “hints”
for most types of branch instructions. These hints allow a compiler to indicate
additional information to the hardware in regards to how branches are expected
to execute. The hardware, in turn, may use this information to predict how the
flow of control is expected to operate and can allocate resources more
efficiently and, ultimately, execute the whole of the code stream faster. For
example, one such branch “hint” type would be “dynamic, predicted not taken”.
This hint implies that the compiler expects that the conditional branch will
generally not be taken but the hardware should allocate prediction resources
(such a history of branches taken or not taken at this location).
Based on both performance testing and research papers
evaluated, the “interp compiler” utilizes these branch hints in the generated
instruction stream. Unconditional branches are specified as “static, taken”,
most conditional branches are specified as “dynamic, predicted not taken”.
Exceptions to these rules are backward (typically involved in a loop) branches
which likely are specified as “dynamic predicted taken”.
Exception Handling
In order to allow OpenVMS exception handling mechanisms to
function properly, the “interp compiler” must “register” each generated routine
with the operating system. This registration includes identifying any “unwind”
information specifically regarding the routine’s first and last instruction, the
length of the routine’s prologue and registers and stack usage. Because the
created code will be both created and executed in the processor’s executive
mode, a kernel mode image exit handler is utilized to un-register the generated
code during image run down. Without having such unwind information registered
with the operating system, exception handling is not possible; otherwise an
exception from the generated code, or code that is called by the generated code,
cannot be handled and results in, depending on the mode and context, either
process or image termination.
The OpenVMS calling standard uses a variant of the common
Itanium standard which includes a moderately complex set of rules for
representing unwind information. This scheme includes a compressed
variable-length fields and a dense structure. Within the interp compiler, this
“signature” information is produced at the end of executable code creation for
each routine.
Simple Code Sequence Examples
The following code sequence shows the original “rich”
instruction (indicated by longword address and content fields at the left part
of the line) CLR_Q (clear quadword) along with its single operand followed by
the generated Itanium instructions (indicated by quadword content and
instruction addresses) created for the “rich” instruction.
The operation’s addressing mode is evaluated as an offset
from the global register r2 (this register maps to Alpha register R12 within
Oracle Rdb). The offset is created by adding 8000 to the value 0040 and then
sign extending from 16 to 64 bits. Next, r0 (which is always read as the value
zero) is written to the destination address, thus clearing it.
It would obviously be possible to combine these two instructions in to a single
bundle. However, the construction of the interp compiler is such that the
addressing steps are evaluated first and then the operation steps are produced.
While it would be viable to perform a second pass to combine the instructions in
to a single bundle, it has not yet been a high priority for execution
optimization.
An additional concern for such optimization is that
significant amounts of time could be spent in the interp compiler that could
exceed the potential benefits for performance improvements of the generated
code. In this case, for example, the stall caused by the memory reference will
dramatically overshadow any other optimizations possible for the two
instructions which still require a stop between them (as the first updates r14
which is used as input to the second).
s037E54B4 04000157 CLR_Q
037E54B8 D0000040 dst quad*
{ .mfi
013807E80380 0000000080366190 add r14 = 3F8040, r2
000008000000 0000000080366191 nop.f 000000
000008000000 0000000080366192 nop.i 000000 ;; }
{ .mfi
008CC0E00000 00000000803661A0 st8 [r14] = r0
000008000000 00000000803661A1 nop.f 000000
000008000000 00000000803661A2 nop.i 000000 ;; }
In the next example, the MOV_Q (move quadword) “rich”
instruction has two operands (source and destination address information). The
source location is indicated as an offset (00000050) from a register (Alpha
register R16 which is translated as Itanium register r32; the first input
parameter to the routine as specified in the OpenVMS calling standard). The
destination is an offset (0040) from register r2 (strictly, the offset is
FFFF8040 from register R30). The interp compiler detects that both source and
destination addresses are likely to be at least quadword aligned and produces a
single ld8 instruction to read the source quadword and a single st8 instruction
to write to the destination.
037E54C4 04000145 MOV_Q
037E54C8 00000010 src char* (R16)
037E54CC 00000050 offset long
037E54D0 D0000040 dst quad*
{ .mfi
0108020A0380 00000000803661C0 add r14 = 0050, r32
000008000000 00000000803661C1 nop.f 000000
000008000000 00000000803661C2 nop.i 000000 }
{ .mfi
013807E803C0 00000000803661D0 add r15 = 3F8040, r2
000008000000 00000000803661D1 nop.f 000000
000008000000 00000000803661D2 nop.i 000000 ;; }
{ .mmi
0080C0E00400 00000000803661E0 ld8 r16 = [r14] ;;
008CC0F20000 00000000803661E1 st8 [r15] = r16
000008000000 00000000803661E2 nop.i 000000 ;; }
More Complex Examples
In the following example generated code, the “rich” instruction MOV_NB_BR_CLR
is used to move a null bit (an indication of a database field within a row
containing a value) to a byte and then branch if the bit was clear (indicating
in this case that the field was not null). Note that there are 4 operands to the
“rich” instruction. The interp compiler turns this “rich” instruction in to 8
Itanium instructions stored in three bundles.
The first two instructions add the offset 00CA to r32 (the first input
parameter to the subroutine) and then fetch a byte from the resultant location.
The next two instructions move the offset 3F8030 to r14 and then add r30 to r14
to result in the output address of the null bit.
The fifth and sixth instructions first extract the bit specified in the first
operand of the rich instruction and then test the bit to determine if it is set
or clear. The extr.u instruction extracts one bit from the specified position (4
in this case) and stores the result starting at bit 0 in the register r15. In
the next instruction (tbit.z), the predicate register p6 will be set if the null
bit is equal to zero and will be cleared if the null bit is not equal to zero.
Finally the resultant null byte is stored. If the null bit is clear
(indicating that the database field has a value), a branch is to be taken. The
branch displacement is a sign-extended 21-bit value indicating a number of
longwords. Here, it is a forward branch of 4 longwords. In the instruction
stream, if predicate p6 is true (which indicates that the null bit was not set),
a relative branch is taken. Otherwise, if predicate p6 is false, the branch is
not taken and execution continues at the first instruction of the following
bundle.
037E546C 0400019E MOV_NB_BR_CLR
037E5470 00000004 bitNum ubyte
037E5474 000000CA nulByt ulong
037E5478 D0000030 res ubyte*
037E547C FFE00004 brOff ulong (037E5490)
{ .mmi
01080A0943C0 00000000803660D0 add r15 = 00CA, r32 ;;
008000F003C0 00000000803660D1 ld1 r15 = [r15]
013807C60380 00000000803660D2 mov r14 = 3F8030 ;; }
{ .mii
010001E1C380 00000000803660E0 add r14 = r14, r30
00A400F103C0 00000000803660E1 extr.u r15 = r15, 04, 01 ;;
00A038F00180 00000000803660E2 tbit.z p6, p7 = r15, 00 }
{ .mfb
008C00E1E000 00000000803660F0 st1 [r14] = r15
000008000000 00000000803660F1 nop.f 000000
008400009006 00000000803660F2 (p6) br.cond.dptk.many 0000040 ;; }
Within the database environment, string operations (moving, changing and
comparing) are common. The following example demonstrates the compiled code for
the CMP_S “rich” instruction which is used to compare two fixed the length
strings. The first operand is the number of bytes to compare. The second operand
is the address of the first string and the third operand is the address of the
second string. CMP_S returns either -1, 0 or 1 to the return status register (r8
which maps to Alpha R0) depending on the relationship (less than, equal, greater
than) of the two strings.
The loop count application register ac.lc is used in conjunction with the
br.cloop (branch counted loop) instruction to implement the main loop construct.
Within the body of the loop, two bytes are fetched with a post increment of the
source registers. Then the cmp.eq instruction is used to compare the values of
the bytes for equality. Predicate register p7 is set if the comparison detects
inequality. Un-equal values result in a branch out of the loop. Otherwise (in
the case of the bytes being equal to each other), a backwards branch is taken by
the br.cloop instruction to the prior bundle to fetch the next bytes.
When the bytes are known not equal, they are compared to each other with the
cmp.lt instructions. If the strings are equal (when the loop executes to
completion and no different bytes had been detected), r8 remains as zero. If the
last bytes fetched are not equal (indicating that the loop did not complete and
a difference was found), r8 is set to either -1 or 1. Note that within the final
two bundles, the comparisons are done in parallel (the instructions can execute
simultaneously because they do not depend on each other) and then the two moves
are executed in parallel. The moves to r8 can be executed simultaneously because
at most one of them will produce a result because predicates p6 and p7 are
mutually exclusive – in no case will both be set. It is possible that neither is
set (when the strings are equal) and r8 will remain 0.
037F22FC 0400013B CMP_S
037F2300 0000001F srcLen uword
037F2304 D0000288 src1 byte*
037F2308 025B5950 src2 byte*
{ .mfi
000008000000 0000000080372090 nop.m 000000
000008000000 0000000080372091 nop.f 000000
00005413C000 0000000080372092 mov.i ar.lc = 1E }
{ .mfi
01382FE10380 00000000803720A0 add r14 = 3F8288, r2
000008000000 00000000803720A1 nop.f 000000
000008000000 00000000803720A2 nop.i 000000 }
{ .mlx
000008000000 00000000803720B0 nop.m 000000
000000000009 00000000803720B1 movl r15 = 00000000025B5950 ;;
00C596CA03C0 }
{ .mmi
00A000E02400 00000000803720C0 ld1 r16 = [r14], 001
00A000F02440 00000000803720C1 ld1 r17 = [r15], 001
000008000000 00000000803720C2 nop.i 000000 ;; }
{ .mbb
01C039120180 00000000803720D0 cmp.eq p6, p7 = r16, r17
008600003007 00000000803720D1 (p7) br.cond.dpnt.many 0000010
0091FFFFE140 00000000803720D2 br.cloop.sptk.few 1FFFFF0 ;; }
{ .mii
010800000200 00000000803720E0 mov r8 = r0
018001120180 00000000803720E1 cmp.lt p6, p0 = r16, r17
0180010221C0 00000000803720E2 cmp.lt p7, p0 = r17, r16 ;; }
{ .mfi
013FFFCFE206 00000000803720F0 (p6) mov r8 = 3FFFFF
000008000000 00000000803720F1 nop.f 000000
012000002207 00000000803720F2 (p7) mov r8 = 000001 ;; }
In the case of the SET_T (set text) instruction, one or more bytes of a
constant value are written to memory starting at a specified location. The
interp compiler attempts to optimize these memory writes by performing
overlapped operations and performing as few writes as possible by promoting the
size of the memory reference based on the minimum alignment of the read and
write stream pointers. Two pointers are used, offset by 8 bytes, to allow
multiple st8 instructions to be executed in parallel. Post-increment instruction
modes are used to update the output pointers in order to avoid additional
instructions that would otherwise be required in order to increment the
pointers. “Tail” writes of one, two or four bytes are used to complete the
sequence.
If the byte count for the fill was larger, a loop would have been generated
to perform the fill. In addition, the interp compiler produces, as needed, code
to perform one, two or four byte writes prior to the loop and then again after
the loop in order to align the output pointer on an 8 byte boundary so that as
few memory writes as possible are created.
037E5550 04000151 SET_T
037E5554 00000020 src byte
037E5558 D0000060 dst char* #XFFFF8060(R12)
037E555C 0000001F dstLen word
{ .mfi
013807EC0380 0000000080366340 add r14 = 3F8060, r2
000008000000 0000000080366341 nop.f 000000
000008000000 0000000080366342 nop.i 000000 ;; }
{ .mlx
010800E10400 0000000080366350 add r16 = 0008, r14
008080808080 0000000080366351 movl r15 = 2020202020202020 ;;
00C2002403C0 }
{ .mmi
00ACC0E1E400 0000000080366360 st8 [r14] = r15, 010
00ACC101E400 0000000080366361 st8 [r16] = r15, 010
000008000000 0000000080366362 nop.i 000000 ;; }
{ .mmi
00ACC0E1E200 0000000080366370 st8 [r14] = r15, 008 ;;
00AC80E1E100 0000000080366371 st4 [r14] = r15, 004
000008000000 0000000080366372 nop.i 000000 ;; }
{ .mmi
00AC40E1E080 0000000080366380 st2 [r14] = r15, 002 ;;
00AC00E1E040 0000000080366381 st1 [r14] = r15, 001
000008000000 0000000080366382 nop.i 000000 ;;
Accessing Unaligned Data
Both the Alpha and Itanium systems impose a severe performance penalty when
the processor attempts to perform an unaligned memory reference. An unaligned
reference, for example, would be to attempt to fetch a longword (4 bytes) from a
virtual address where the two lowest bits are not clear (i.e. not aligned on a 4
byte boundary). And the penalty on OpenVMS Itanium systems is significantly
higher than it is on Alpha systems. Thus, avoiding alignment faults has an even
greater benefit (for all processes on the system) on Itanium systems.
The “interp compiler” attempts to detect memory references that are not
naturally aligned and produces a longer code sequence to perform the memory read
or write operation without the overhead of an alignment fault.
For “rich” instructions, the assumption is made that register addresses are
naturally aligned on quadword (8 byte) boundaries. Offset values can be then
evaluated to determine if the resultant memory address is aligned or not. When
an unaligned reference is predicted, a sequence of instructions can be generated
to avoid the fault. For example, a load of a quadword that is located on a
longword boundary an be accomplished by fetching the two longwords and then
merging them together with the “mix4.r” instruction:
{ .mmi
013807CF8E40 000000008033E180 mov r57 = 3F807C ;;
000008000000 000000008033E181 nop.m 000000
01000393CE40 000000008033E182 add r57 = r30, r57 ;; }
{ .mmi
00A083908900 000000008033E190 ld4 r36 = [r57], 004 ;;
008083900380 000000008033E191 ld4 r14 = [r57]
000008000000 000000008033E192 nop.i 000000 ;; }
{ .mfi
000008000000 000000008033E1A0 nop.m 000000
000008000000 000000008033E1A1 nop.f 000000
00F88241C380 000000008033E1A2 mix4.r r14 = r14, r36 ;; }
The following sequence (adapted from analysis of code generated by the HP GEM
compiler backend) is a longword store where the destination is predicted to not
be naturally aligned. The least significant bit of the address (presented in r3)
is tested. If it is set, the address is byte aligned and p7 is set; otherwise
the address is word aligned and p6 is set. In the case of byte alignment, a
single byte is stored and the address is incremented (thus aligned on a word
boundary) and the output value is shifted 8 bits to the right. A word is then
stored and the output is shifted right 16 bits. Finally, if the original address
was word aligned, the final word is written, otherwise the final byte is
written. This sequence results in either two (for word alignment) or three (for
byte alignment) memory writes.
00A072000180 0611 tbit.z pr6, pr7 = r3, 0
00AC0031004E 0621 (pr7) st1 [r3] = r8, 1
00A5B882020E 0622 (pr7) shr.u r8 = r8, 8
00AC40310080 0630 st2 [r3] = r8, 2
00A578840200 0631 shr.u r8 = r8, 16
008C40310006 0640 (pr6) st2 [r3] = r8
008C0031000E 0641 (pr7) st1 [r3] = r8
Optimizations
In some situations, the generated Itanium code sequences will execute faster
than corresponding sequences on Alpha. For example, in cases of filling or
comparing a relatively few bytes of memory, the code generated for Itanium
includes a sequence of memory stores or fetches in-line while the Alpha code
calls to the operating system routines OTS$FILL or OTS$CMP variants. The
overhead of the call in some instances will be greater than the actual memory
references.
In other cases, the Itanium instruction set provides instructions that
perform operations that require a sequence of instructions on the Alpha
platform. For example, the “mux1@rev” instruction can be used to reverse the
order of bytes within a quadword. Within Oracle Rdb on the Alpha platform, this
operation is accomplished in a series of independent shift and mask
instructions. This byte reversal is, for example, used when constructing index
keys so performance is an important consideration as this may be a commonly
executed sequence.
Optimization does tend to be a repetitive and, at least based on our
observations, a never-ending process. Over time, code sequences are compressed
and improved with a goal of reducing latency in regards to the CPU clock rate
and memory access latencies.
For example performance analysis both “by eye” and by processor cycle
sampling lead to reductions in code steams by often “combining” addressing
operands in to a single bundle as in the following example; initially the
operands (moving the address values to r14 and r15) would have required two
bundles.
06CE1B68 04000148 MOV_B
06CE1B6C D00002D0 src byte*
06CE1B70 D00010F1 dst byte*
{ .mfi
01382FEA0380 00000000805326F0 add r14 = 3F82D0, r2
000008000000 00000000805326F1 nop.f 000000
01390FEE23C0 00000000805326F2 add r15 = 3F90F1, r2 ;; }
{ .mmi
008000E00400 0000000080532700 ld1 r16 = [r14] ;;
008C00F20000 0000000080532701 st1 [r15] = r16
000008000000 0000000080532702 nop.i 000000 ;; }
Instruction Execution Frequency
As part of a performance analysis sub-project, we created an instrumented
interpretation engine that sent, via an OpenVMS mailbox, instruction execution
information from all processes on a system to a separate collector process that
captured instruction execution counts during a portion of an Oracle Rdb
regression test run. The following table includes the top 20 instructions and
the number of times each instruction was executed. In the table, the indication
“RICH” indicates a “rich” instruction and “EVAX” indicates an Alpha instruction.
Instruction Mnemonic |
Execution Count |
RICH_MOV_B |
60,974,337 |
RICH_MOV_NB_BR_CLR |
60,537,063 |
RICH_MOV_L |
47,099,034 |
RICH_MOV_Q |
42,723,231 |
RICH_B_BR_SET |
32,023,634 |
RICH_BRANCH |
23,386,664 |
RICH_MOV_S |
21,233,064 |
EVAX_LDA |
21,169,625 |
EVAX_BIS |
17,947,775 |
RICH_MOV_W |
16,446,120 |
RICH_CMP_L |
16,401,014 |
EVAX_JSR |
15,925,377 |
RICH_MOV_B_BR_SET |
14,723,867 |
RICH_EXE_ACTION |
14,336,902 |
RICH_MOV_NBIT_BR_SET |
14,138,224 |
EVAX_BR |
13,827,371 |
RICH_OR_B_BR_SET |
10,937,403 |
RICH_CNV_SORT_N |
8,293,658 |
RICH_STALL |
7,429,742 |
EVAX_LDAH |
6,818,905 |
This data was, in turn, used as a guide for which instructions should be
first considered for increased optimization by the interp compiler. The idea is
that an instruction executed several times an hour has a marginal impact on
performance as compared with an instruction executed thousands of times per
second.
Database Performance Improvements
The creation and optimization of the Oracle Rdb “interp compiler” has been an
iterative affair. Initial performance improvements from the interp compiler
allowed applications running on Itanium systems to run at least as fast as on
Alpha systems. Further optimizations (including reducing memory references,
eliminating unneeded “stops”, avoiding alignment faults, and so on) have
dramatically improved code quality and yielded even better performance. In some
cases, application performance has improved by a factor of 3 due to the interp
compiler generating native instructions.
And Oracle continues to measure and analyze performance of the Oracle Rdb
database product family on the HP OpenVMS operating system for the Integrity
Server platform. An extensive set of regression tests are continuously run in
our development environment to help ensure correctness of the generated code. We
are also in constant contact with our customer based to help understand their
performance challenges. This input helps us decide where to focus our
optimization efforts to everyone’s benefit.
Models and Examples Followed
A number of different resources were referenced in regards to code
generation. In addition to the (voluminous) Intel documentation of the Itanium
architecture, we also utilized the compiler machine code listings from high
level language compilers on OpenVMS (for example, BLISS, C and MACRO32 which all
use GEM code generator and the C++ compiler which uses an Intel code generator).
Both the OpenVMS debugger and system dump analyzer include the ability to
format an instruction stream which helped significantly when we were learning
the intricate details of the Itanium architecture.
The OpenVMS listings include the MACRO2000 facility which implements the
MACRO32 compiler. This was used in many cases as a template for code generation
for complex alpha instructions (such as ZAP and MSK). The internet also proved
to be an excellent resource for example instruction streams and discussions of
Itanium performance in regards to the use of the architecture.
The Intel and OpenVMS documentation was referenced extensively while we were
creating the unwind information tables for generated code. And the OpenVMS
calling standard manual was invaluable in regards of register usage rules.
Credit and Thanks
A large number of people devoted a great many hours to this project of
developing the Oracle Rdb “Just In Time” code generator for the Itanium systems.
It is not possible to remember or credit everyone who was involved. But special
thanks and recognition are due to engineering members of both HP and Oracle
including: John Reagan, Jeanie Leab, Guenther Froehlin, Greg Jordan, Christian
Moser, Burns Fisher, Ian Smith, Martin Ramshaw, and Richard Bishop.
For more information
The Oracle Rdb web site is accessible on the internet at
www.oracle.com/rdb.
For more information about the Intel Itanium architecture and instruction set,
visit www.intel.com. For more information about the HP OpenVMS system, visit
www.hp.com and www.hp.com/go/openvms |