James Hicks
Richard Weiss
Compaq Computer Corporation
February 1999
Key words: video, multimedia, high
performance computing, prefetching, motion estimation, filtering, Alpha
processor
In this paper you will find
Alpha is a 64-bit processor architecture. The architecture is defined by a living architectural specification. It is intended to have multiple implementations over many years; the current implementations are all superscalar, which means that multiple instructions are issued in the same cycle. For the 21164 chip, up to four instructions can be issued in a cycle, and for the 21264, up to six instructions can be issued. For each implementation there are rules that constrain the instructions that can be co-issued. For example, the 21264 has four ALU’s on the integer side, but only one of them can execute an integer multiply, so at most one integer multiply instruction can be issued in a cycle.
Motion Video Instructions or MVI are the DIGITAL Alpha’s Multi Media instructions. They are a set of Alpha processor instructions that use a single instruction to operate on multiple data in parallel (SIMD). This is accomplished by partitioning a 64bit Quadword into a vector of 8 separate bytes or 4 separate words (16bit). Any code that is capable of taking advantage of this parallelism can achieve up to an 8X performance boost. The instructions are:
MINUB8, MAXUB8
unsigned byte minimum/maximum
MINSB8, MAXSB8
signed byte minimum/maximum
MINUW4, MAXUW4
unsigned word minimum/maximum
MINSW4, MAXSW4
signed word minimum/maximum
PKWB, UNPKBW
pack/unpack word to byte
PKLB, UNPKBL
pack/unpack long to byte
PERR
pixel error
This guide is an introduction to some techniques in high performance computing with a focus on MVI. There are a few key techniques for designing efficient code for an Alpha processor. This first is to make sure that your data is in the first level cache when you want to use it. The sections on prefetching and cache blocking explain this. The second is to take advantage of instruction level parallelism. This is explained in the sections on software pipelining and scheduling issues. The third technique is to take advantage of data parallelism. This is what MVI is about. The relevant section below explains how to detect if a processor supports MVI. This will allow the programmer to branch to alternate implementations, so that legacy code can be extended. Each of the MVI instructions is described with a sample trace to show how it works. There are many examples of how MVI instructions are used and a side-by-side example of MVI code with MMX code. There are also examples of how to use MVI for motion estimation in an MPEG encoder and for image filtering. Since MVI is not currently supported by commercial "C" compilers, it will be necessary to include some assembly code to use it. This is relatively easy and can be done with inline assembler MACRO. This allows the programmer to use assembly language instructions as if they were C statements. This can be very useful not only for MVI but for prefetching and other techniques as well.
The techniques described in this guide can be used with OpenVMS, Digital UNIX, or Windows NT. Some of the details may change with different operating systems.
While these new instructions were intended to implement high quality software video encoding like MPEG-1,MPEG-2, H.261 (ISDN video conferencing) and H.263 (Internet video conferencing) only your imagination as a software engineer will limit their uses. Anytime data can be operated on in parallel you will see the benefit. Desktop Video Publishing, Video Conferencing, Internet Commerce and Interactive Training are some target trends in visual computing.
If the application processes a large amount of data as fast as possible (high memory bandwidth) and the data are all 8bit or 16bit integers (Byte/Word Integer Data) and the same operations are performed on all the data (Parallelism), then you definitely want to explore MVI. MVI can make a critical difference in achieving video-rate encoding.
The 21164PC has the first implementation of MVI. All Alpha processors since that one including the 21264 have MVI, and all future processors will implement it.
Before taking advantage of MVI instructions one must be running on hardware that supports these extensions. This can be determined at run time by looking at bit eight in the value the AMASK instruction returns. Future extensions may use other AMASK bits. The code to do this is written in assembly language rather than C or C++ since there is no statement that corresponds to AMASK. Assembly language code in can be inserted into the object file using the macro __asm. A detailed description of the instruction can be found in the Alpha Architecture Handbook.
The AMASK instruction takes three
register arguments. The first register (Ra) must be R31. The
second register (Rb) has the input mask, which represents the requested
architectural features. There is a one in every bit position that
is being queried. The third register (Rc) is the output. Bits
are cleared that correspond to architectural extensions that are present.
Reserved bits and bits that correspond to absent extensions are copied
unchanged. If the result is zero, all requested features are present.
Software may specify an input mask of all 1’s to determine the complete
set of architectural extensions implemented by a processor. Assigned bit
definitions are defined below.
AMASK Bit Assignments
Bit 0. Support for the byte/word extension (BWX) The instructions that comprise the BWX extension are LDBU, LDWU,SEXTB,SEXTW,STB, and STW.
Bit 1. Support for the count extension (CIX) The instructions that comprise the CIX extension are CTLZ,CTPOP,CTTZ,FTOIS,FTOIT, ITOFF, ITOFS, ITOFT, SQRTF, SQRTG,SQRTS, and SQRTT.
Bit 2. Support for CIX instructions (not including SQRTG,SQRTS, and SQRTT).
Bit 8. Support for the multimedia extension (MAX) The instructions that comprise the MAX extension are MAXSB8, MAXSW4, MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB, PKWB, UNPKBL, and UNPKBW.
Bit 9. Support for Precise arithmetic trap reporting
Software Note:
Use this instruction to make instruction-set decisions; use IMPLVER to make code-tuning decisions.
Implementation Note:
Instruction encoding is implemented as follows: On 21064/21064A/21066/21068/21066A (EV4/EV45/LCA/LCA45 chips), AMASK copies Rbv to Rc.On 21164 (EV5), AMASK copies Rbv to Rc.
On 21164A (EV56), 21164PC (PCA56), and 21264 (EV6), AMASK correctly indicates support for architecture extensions by copying Rbv to Rc and clearing appropriate bits.
AMASK Code Examples.
Assembler file or "S" file
This short piece of code is an assembler file that can be made into an .obj file and linked to a "C" or "C++" program.
Notice #include <kxAlpha.h>. This header file comes with the VC RISC edition compiler and the Microsoft SDK. It is full of some very interesting information about the DIGITAL Alpha. There you can find the expansion of the LEAF_ENTRY() MACRO, which defines an entry point for the compiler and initializes the stack pointer.
- __int64 amaskValue;
- amaskValue = getAMASK( SOME_64BIT_MASK );
Inline Assembler MACRO (asm's)
This section describes how to use the function "__asm" to include assembly language code in your object file. Here it is used to insert the AMASK instruction and test bits in the mask returned, but it can be used more generally when assembly code is more suitable than C or C++. First it is necessary to convince the compiler that there is a function called "__asm." We do that with the extern long __asm( char *, … ); declaration. This is saying that __asm is some function that returns a long – on an Alpha any integer value returned from a function will be in the v0 register and the compiler knows that.
Next we declare the __asm function to be intrinsic. That means it will be implemented in native assembler word for word. This can be seen in the line below that reads #pragma intrinsic __asm
Following this there are a couple of #define’s for several of the possible bit mask used to determine the availability of a desired extension. The bit we are concerned with is bit 8 and is defined as SUPPORT_FOR_MVI ((__int64)((0x0100))
Finally the MACRO definition is provided. This MACRO uses inline assembler to invoke the "amask" instruction. The inline code alone is as follows: __asm("amask $a0,$v0",(x) ) where (x) represents some 64bit bit mask with the bits we are interested in set or cleared as appropriate. One could just as easily write someInt64 = __asm("amask $a0,$v0", 0x0100 );, then test the value of the variable someInt64 == 0 indicating MVI is supported.
Let’s take a moment to review this inline assembler code.
The string "amask $a0,$v0" is indicating that register a0 contains the bitmask required by the amask instruction and that register v0 will receive the result. This is just as we had written this code in Alpha assembler as amask a0,v0. The last parameter in the __asm() "function" call is the bitmask value that will be placed in the a0 register before the amask instruction is executed. For UNIX, the "function" is asm() without the "__".
The MACRO "thisAlphaHas()" contains a logical NOT (!) to get the code to read and function with positive logic allowing a programmer to write
- if ( thisAlphaHas(SUPPORT_FOR_MVI ) )
- printf( "MVI Supported \n");
- else
- printf( "MVI Not Supported \n");
- __int64 amask, implver;
- int i;
- char opt;
- char *implname[]={"21064 (EV4)", "21164 (EV5)", "21264 (EV6)"};
- #define MAXAMASK 16
- "FIX (Floating point instruction extensions)"
- "CIX (Count instruction extensions)"
- "unused3",
- "unused4",
- "unused5",
- "unused6",
- "unused7",
- "MVI (Motion video instruction extensions)",
- "Precise arithmetic trap reporting in hardware",
- "unused10",
- "unused11",
- "unused12",
- "unused13",
- "unused14",
- "unused15" };
- while((opt = getopt(Argc, Argv, "vqd?")) != -1) {
- switch (opt) {
- case 'v': Verbose = 1; break;
- case 'q': Quiet = 1; break;
- case 'd': Debug = 1; break;
- default:
- printf("cputype [options]\n/v Verbose\n/q Quiet\n/d Debug\n");
- exit(1);
- }
- }
- amask = ~__asm("amask $a0, $v0", -1);
- implver = __asm("implver $v0");
- if (!Quiet) {
- if (Debug)
- printf("Implver = %d\n", implver);
- printf("Implementation version: %s\n\n", implname[implver]);
- if (Debug)
- printf("Amask = 0x%x\n", amask);
- printf("Architecture mask\n");
- }
- for (i=0;i<MAXAMASK;i++) {
- if (Verbose && (strncmp(amaskname[i], "unused", 6)))
- printf(" %s: %s\n", amaskname[i], (amask & (1<<i))?"YES":"NO");
- else if (!Quiet && (amask & (1<<i)))
- printf(" %s\n", amaskname[i]);
- }
MVI Instruction set description
Alpha MVI code and x86 MMX code are used in the following examples. No judgements or comments about relative performance of similar code are made or implied. These two architectures are very different in their respective implementations. Keep in mind that RISC architectures in general will have more assembler instructions representing fewer clock cycles. It is likewise important to remember that if used improperly these instructions can stall both chips. These examples are in no way – the best way. Refer to the appropriate documentation for the best information on optimizing your code and preventing stalls. These are brute force examples of functionality only.
Unpack Instructions
UNPKBW – Unpack Bytes to Words expands the low four bytes in a quadword to four words in a quadword. This implies two unpacks to re-acquire all eight pixels.
Packed Byte: 8 bytes packed into 64bits
REGISTER r1 at start
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1
Alpha MVI code
x86 MMX code
- unpkbw r1, r5 // unpack the low four bytes into reg r5
- srl r1, 32, r1 // shift reg r1 right 32 bits, result in r1
- unpkbw r1, r6 // unpack the high four bytes into reg r6
- pxor mm6, mm6 // set mm6 = 0
- movq mm2, mm1 // mm2 = mm1
- punpcklbw mm1, mm6 // unpack low four bytes into mm1
- punpcklbh mm2, mm6 // unpack high four bytes into mm2
Alpha REGISTER r5 after unpkbw r1,r5
Bit63
Bit0
Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r1 after srl 32
Bit63
Bit0
0 | 0 | 0 | 0 | Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha REGISTER r6 after unpkbw r1,r6
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
UNPKBL – Unpack Bytes to long words expands the low two bytes in a quadword to two long words in a quadword. This implies four unpacks to re-acquire all eight pixels. While this use would be used much less often here is how it looks.
Packed Byte: 8 bytes packed into 64bits
Alpha REGISTER r1 at start
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1
Alpha MVI code
x86 MMX code
- unpkbl r1, r5 // unpack the two low bytes into register r5
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r6 // unpack the next two bytes into register r6
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r7 // unpack the next two bytes into register r7
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r8 // unpack the next two bytes into register r8
- pxor mm6, mm6 // clear mm6
- movq mm2, mm1 // mm2 = mm1
- punpcklbw mm1, mm6 // unpack low four bytes into mm1
- punpcklbh mm2, mm6 // unpack high four bytes into mm2
- movq mm3, mm1 // mm3 = mm1
- punpcklwd mm1, mm6 // unpack low two words into mm1
- punpcklwd mm3, mm6 // unpack high two words into mm3
- movq mm4, mm2 // mm4 = mm2
- punpcklwd mm2, mm6 // unpack low two words into mm2
- punpcklwd mm4, mm6 // unpack high two words into mm4
Alpha REGISTER r5
Bit63
Bit0
Pixel 1 | Pixel 0 |
Alpha REGISTER r6
Bit63
Bit0
Pixel 3 | Pixel 2 |
Alpha REGISTER r7
Bit63
Bit0
Pixel 5 | Pixel 4 |
Alpha REGISTER r8
Bit63
Bit0
Pixel 7 | Pixel 6 |
Pack Instructions
PACKWB –
Truncates the four (4) component words of the input register and writes
them to the low four (4) bytes of the output register.
Alpha REGISTER r5 at start of packwb
Bit63 Bit0
Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 at start of packwb
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha MVI code
SPECIAL NOTE: the following three lines of MVI code will not work because the upper four (4) bytes of the destination register are written with zero’s by packwb. Therefore the second packwb would write zero’s over the first pixels shifted data.
0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 after packwb r6,r6
Bit63
Bit0
0 | 0 | 0 | 0 | Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha REGISTER r6 after sll r6, 32, r6
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | 0 | 0 | 0 | 0 |
Alpha REGISTER v0 after bis r5,r6,v0
Bit63
Bit0
Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
PACKLB – Truncates
the two (2) component long words in the input register to byte values and
writes them to the low two (2) bytes of the output register.
Alpha REGISTER r5 at start
Bit63
Bit0
Pixel 1 | Pixel 0 |
Alpha REGISTER r6 at start
Bit63
Bit0
Pixel 3 | Pixel 2 |
Alpha REGISTER r7 at start
Bit63
Bit0
Pixel 5 | Pixel 4 |
Alpha REGISTER r8 at start
Bit63
Bit0
Pixel 7 | Pixel 6 |
Alpha MVI code
- packlb r5, r5 // trunc and pack 2 dwords into 2 bytes
- packlb r6, r6 // trunc and pack 2 dwords into 2 bytes
- sll r6, 16, r6 // shift pixels 3 and 2 into place
- bis r5, r6, r5 // logical OR pixels 3 and 2 into r5
- packlb r7, r7 // trunc and pack 2 dwords into 2 bytes
- sll r7, 32, r7 // shift pixels 5 and 4 into place
- bis r5, r7, r5 // logical OR pixels 5 and 4 into r5
- packlb r8, r8 // trunc and pack 2 dwords into 2 bytes
- sll r8, 48, r8 // shift pixels 7 and 6 into place
- bis r5, r8, r5 // logical OR pixels 7 and 6 into r5
Alpha REGISTER r5 after packwl r5,r5
Bit63
Bit0
0 | 0 | 0 | 0 | 0 | 0 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 after packwl r6,r6
Bit63
Bit0
0 | 0 | 0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 |
Alpha REGISTER r6 after sll r6,16,r6
Bit63
Bit0
0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | 0 | 0 |
Alpha REGISTER r5 after bis r5,r6,r5
Bit63
Bit0
0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r7 and r8 just repeat this sequence after shifting the correct number of bits.
Byte and Word Minimum and Maximum (MINxxx) (MAXxxx)
MINUB8 Vector Unsigned Byte Minimum
MINUW4 Vector Unsigned Word Minimum
MINSB8 Vector Signed Byte Minimum
MINSW4 Vector Signed Word Minimum
MAXUB8 Vector Unsigned Byte Maximum
MAXW4 Vector Unsigned Word Maximum
MAXSB8 Vector Signed Byte Maximum
MAXSW4 Vector Signed Word Maximum
Where the values in register Ra are compared to the value in register Rb and the result is placed in register Rc. These are vector wise comparisons. That is each byte or each word is compared when using the xxxxB8 or xxxxW4 instructions respectively.
- These instructions take the form MINxxx Ra,Rb,Rc
- MAXxxx Ra,Rb,Rc
Alpha MVI code
minub8 r5, r6, r6 //
get the minimum’s in each BYTE position
Alpha REGISTER r5 at start
1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
Alpha REGISTER r6 at start
0 | 1 | 2 | 2 | 0 | 0 | 1 | 1 |
Alpha REGISTER r6 after calling minub8 r5,r6,r6
0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Alpha MVI code
minuw4 r5, r6, r6 // get the minimum’s in each WORD position
Alpha REGISTER r5 at start
0x0000 | 0x00FF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
0x0000 | 0x0001 | 0x0000 | 0x00F3 |
Alpha REGISTER r6 after call to minuw4 r5,r6,r6
0x0000 | 0x0001 | 0x0000 | 0x0001 |
This instruction takes the eight bytes packed into two quadword registers and computes the absolute differences between them, then adds the eight intermediate results and right aligns the result in the destination register. The net result is that motion estimation calculations on eight pixels can be done in a single clock tick on an MVI capable Alpha.
Paul Rubinfeld, Bob Rose and Michael McCallig present several very good examples of applications that use PERR in their paper entitled "Motion Video Extensions for Alpha." Look in Helpful URL’s for more information.
Alpha MVI code
perr r5,r6,v0
x86 MMX code
Alpha REGISTER r5 at start
1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
Alpha REGISTER r6 at start
0 | 1 | 2 | 2 | 0 | 0 | 1 | 1 |
Intermediate Absolute Differences
1 | 1 | 1 | 2 | 1 | 0 | 0 | 1 |
Sum of Absolute Differences placed in Alpha
REGISTER V0 (64bit)
0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
How MVI instructions are used in software
In this section, we give some guidelines
and examples for using the instructions above. The first issue is
whether you are writing code for just one type of machine or not.
The section on Detecting MVI
was usful for determining at runtime
whether the current Alpha machine has MVI. You might be writing the
same code for multiple architectures,
so you may need to test first to determine
if the current machine is an Alpha. The following section on potable
coding explains that. The other subsections describe how to do a
bytewise add with saturation as well as a simple convolution filter.
Portable Coding Techniques
- //+
- // setup the function pointer by testing for MVI support
- //-
- if ( thisProcessorHas( SUPPORT_FOR_MVI ) )
- //+
- // we can use MVI
- // point the function pointer at the MVI function
- //-
- videoProcessingFunctionPointer = functionUsingMVI;
- else
- //+
- // we can not use MVI
- // point at the NON MVI version of the function
- //-
- videoProcessingFunctionPointer = functionWithoutMVI;
- //+
- // we are building for x86 MMX point at that code
- //-
- videoProcessingFunctionPointer = FunctionUsingMMX;
Unsigned Saturating Arithmetic
The current versions of MVI do not have explicit byte- or word-partitioned arithmetic operations, but it is possible to produce the same results with multiple instructions. First we explain why one would want a saturating add. Consider the following situation. Suppose we want to blend two pixels by adding their red, green, and blue values together and dividing by two (shift right one bit). For example, if a simple add is applied to the 16-bit values shown below and then truncated to 16 bits the answer is actually less than both of the addends, and the shifted value will not be the average. The rest of the answer is in the most significant bit or in this case bit 17. Applying this to the green component of a pixel’s color value (assuming that 16 bits are used to represent each color), blending these two medium green pixels together would result in a pixel that is lighter green than the original two. Additionally, since pixel intensities or colors can only range from all of a color - 0xFFFF to none of a color – 0x0000, bit 17 is meaningless to us in terms of "greenness."
+ 0011 0000 0000 0000 (0x3000)
Result truncated to 16bits 0010 0000 0000 0000 (0x2000)
Right shifted value 0x1000 (not the result we desire)
Unsigned saturating arithmetic causes values that would overflow to be "clamped" to the maximum possible value (0xFFFF) and values that would underflow to be "clamped" to the minimum (0x0000). A saturating add of these two green pixels overflows 16 bits and therefore gets "clamped" to a maximum value of 0xFFFF which is a much better representation of the color expected after blending two medium green pixels.
+ 0x3000
result "clamped" to maximum 0xFFFF
right shift with rounding 0x8000 ( much better )
In practice, a better solution might be to first do the division by two and then add the results, in which case overflow is not a real problem. However, the conversion from YUV video representation to RGB representation is a problem that requires saturated arithmetic, where the data are all unsigned byte. The equation for this conversion is given by
Alpha MVI "unsigned saturating add for packed words"
Alpha REGISTER r5 at start
Pixel 3
Pixel 2 Pixel 1
Pixel 0
0x0000 | 0xFFFF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
0x0000 | 0x0001 | 0x0000 | 0xFFFF |
What happens?
First we take the 1’s complement of register
r6. The "eqv" instruction does a more general operation than implied by
the comment. The "eqv" instruction alone is Rc<- Ra XOR (NOT Rb), therefore
when Rb is zero (NOT Rb) is all 1’s and that XOR anything will flip the
bits or produce the 1’s compliment. Why do we want the 1’s compliment of
r6? Well first of all it would work with either r5 or r6 as long as we
called minuw4 using whichever register we took the 1’s compliment of. The
1’s compliment is the largest number we can add to the original number
and not overflow. Knowing that piece of information, we simply pick the
smaller of this pixels 1’s compliment or the corresponding pixels original
value and do the addition. Since we picked the smallest number from a set
of numbers containing the largest value that would not overflow, we are
guaranteed not to overflow.
Alpha REGISTER t0 after call to eqv r6,zero,t0
0xFFFF | 0xFFFE | 0xFFFF | 0x0000 |
Alpha REGISTER r5 after call to minuw4 r5,t0,r5
0x0000 | 0xFFFE | 0x0000 | 0x0000 |
Alpha REGISTER v0 after call to addq r5,r6,v0
0x0000 | 0xFFFF | 0x0000 | 0xFFFF |
Alpha MVI "unsigned saturating subtract of packed words"
Alpha REGISTER r5 at start
Pixel 3
Pixel 2 Pixel 1
Pixel 0
0x0000 | 0x00FF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
0x0000 | 0x0001 | 0x0000 | 0x00F3 |
Alpha REGISTER r6 after call to minuw4 r5,r6,r6
0x0000 | 0x0001 | 0x0000 | 0x0001 |
What happens?
When we started Pixel 0 or the low word in r5 = 0x0001 and in r6 = 0x00F3 subtracting r6 from r5 would result in a value that exceeds the limits of an unsigned 16-bit word i.e. something negative (-242). Recall the behavior of "clamping" to the minimum value – that is the effect achieved when all the minimum pixel values are placed in the r6 register and then subtracted from the original r5 register values. If the minimum pixel value was in r6 as is the case with Pixel 2, then the simple unsigned subtract results in a value that is less than the original but greater than the minimum value. If however the minimum value was in r5 then that value is moved to r6 by the minuw4 instruction and ultimately subtracted from itself resulting in zero, "clamped" to the minimum unsigned value.
Alpha REGISTER V0 after call to subq r5,r6,v0
Pixel 3
Pixel 2 Pixel 1
Pixel 0
0x0000 | 0x00FE | 0x0000 | 0x0000 |
Simple Pixel Filtering Example
Consider a two dimensional array or bitmap of 32bit pixel values arranged as RGBA. Where the RGBA (Red Green Blue and Alpha) values are represented in the four 8bit components of a 32bit unsigned long. A simple convolution filter will replace each pixel by the weighted sum of the values of the pixels surrounding it. This is done independently for RGBA. A typical filter algorithm might employ a loop as in the following "C" language example. The algorithm will have high memory bandwidth. It is Byte/Word Integer Data and can take advantage of parallelism. These are the qualifiers for using MVI.
Loop on row
Loop on column
- Red = 0; // clear accumulators
- Green = 0;
- Blue = 0;
- Alpha = 0;
Alpha MVI code stub (the result is left in a 16 bit value which would need to be shifted and packed into 8 bits again)// loop on some filter length and pull out the RGBA componentsfor ( x=0 x< length of filter ; x++) {
}
- temp = ((inputArray[ row ] [ col ] >> 24 ) && 0xFF);
- Red += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] >> 16 ) && 0xFF);
- Green += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] >> 8 ) && 0xFF);
- Blue += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] ) && 0xFF);
- Alpha += temp * filterValue[x];
- //+
- // work on 2 pixels at a time ( 64bits)
- //-
- addl a1, t8, t11 // t11=addr of data t8=offs to current pixels (2)
- ldq t4, 0(t11) // get 2 pixels from array
- unpkbw t4, t0 // unpack low four bytes RGBA
- srl t4, 32, t4 // shift the high four bytes
- unpkbw t4, t1 // unpack the high four bytes RGBA
- //+
- // t5 holds the filter value
- //-
- mulq t5, t0, t0 // multiply RGBA of pixel 1 by filter value
- mulq t5, t1, t1 // multiply RGBA of pixel 2 by filter value
- //+
- // unsigned saturating adds for RGBA accumulators
- // r6 pixel 1 RGBA accumulator
- //-
- eqv t0, zero, r6 // 1’s compliment of t0
- minuw4 t0, r6, t0 // get the smaller values
- addq r6, t0, r6 // accumulate RGBA’s pixel 1
- //+
- // r5 pixel 2 RGBA accumulator
- //-
- eqv t1, zero, r5 // 1’s compliment of t0
- minuw4 t1, r5, t1 // get the smaller values
- addq r5, t1, r5 // accumulate RGBA’s pixel 2
A Side-by-Side Example of Alpha MVI and x86 MMX
This example is a loop used to blend pixel values.
First – A "C" code example
- unsigned char *frontImage;
- unsigned char *backImage;
- unsigned char *output;
- long l_lImgSizeX;
- long l_lSizeY;
- long y;
- unsigned long pixelInLine;
- unsigned long x;
- unsigned short usFront;
- unsigned short usBack;
- unsigned short usTemp;
- ImgSizeX = 720;
- ImgSizeX >= 2;
- SizeY = 486;
- frontImage = pInputB;
- backImage = pInputA;
- output = pOutput;
- y = 0;
- do {
- //+
- // replace MVI code with loop
- //-
- for ( x=0;x<8;x++ ) {
- usFront = ((unsigned short)(frontImage[x] & 0x00ff));
- usBack = ((unsigned short)(backImage[x] & 0x00ff));
- usFront -= usBack;
- usFront *= s_ubAlpha;
- usFront += 0x0080;
- usTemp = usFront;
- usTemp >>= 8;
- usFront += usTemp;
- usFront >>= 8;
- usFront += usBack;
- output[x] = (unsigned char)( usFront & 0x00ff );
- }
- pixelInLine--;
- //+
- // move the pointers up to the new offset
- //-
- frontImage += 8;
- backImage += 8;
- output += 8;
- } while ( pixelInLine > 0 );
- y++;
}
- }while ( y < l_lSizeY );
- s_ubAlpha--;
Next – x86 MMX Code as inline assembler in a "C" file.
- LONG ImgSizeX = 720;
- LONG SizeY = 486;
- LONG y;
- static __int64 ROUND = 0x0080008000800080;
- static __int64 mmAlphaValue = 0x00FF00FF00FF00FF;
- // l_mmAlphaValue should have A | A | A | A
- __asm
- {
- pxor mm6, mm6 // Clear mm6...
- movq mm5, mmAlphaValue // mm5 = A | A | A | A
- movq mm7, ROUND
- }
- for ( y = 0; y < SizeY; y++) {
- __asm {
- mov esi, pInputB // esi is the front image
- mov edi, pInputA // edi is the back image
- mov edx, pOutput // edx is the output image
- xor ebx, ebx // ebx = 0, pixel offset
- mov ecx, ImgSizeX // ecx = for counter for pixel in line
- shr ecx, 2 // Processing 4 pixels at a time
- jz finisha // Skip if no pixels to process
loopa:
- movq mm0, [esi + ebx] // mm0 = Front[0:7]
- movq mm2, [edi + ebx] // mm2 = Back [0:7]
- movq mm1, mm0 // mm1 = mm0
- movq mm3, mm2
- punpcklbw mm0, mm6 // mm0 = Front[0:3]
- punpckhbw mm1, mm6 // mm1 = Front[4:7]
- punpcklbw mm2, mm6 // mm2 = Back [0:3]
- punpckhbw mm3, mm6 // mm1 = Back [4:7]
- psubw mm0, mm2 // mm0 = Front - Back [0:3]
- psubw mm1, mm3 // mm1 = Front - Back [4:7]
finisha:
- pmullw mm0, mm5 // mm0 = FB0 * Alpha
- pmullw mm1, mm5 // mm1 = FB1 * Alpha
- paddw mm0, mm7 // mm0 = FB0 * Alpha + ROUND = C0
- paddw mm1, mm7 // mm1 = FB1 * Alpha + ROUND = C1
- movq mm2, mm0 // mm2 = C0
- movq mm3, mm1 // mm3 = C1
- psrlw mm0, 8 // mm0 = C0 >> 8
- psrlw mm1, 8 // mm1 = C1 >> 8
- paddw mm0, mm2 // mm0 = C0 + (C0 >> 8)
- paddw mm1, mm3 // mm1 = C1 + (C1 >> 8)
- psrlw mm0, 8 // mm0 = Result Pixel 0
- psrlw mm1, 8 // mm1 = Result Pixel 1
- packuswb mm0, mm1 // mm0 = Result [0:7]
- paddb mm0, [edi + ebx] // Add the back (Cached)
- movq [edx + ebx], mm0 // Store the result
- add ebx, 8 // Goto next pixel
- dec ecx // decrement counter
- jg loopa
- add esi, ebx // Increment the pointers
- add edi, ebx
- add edx, ebx
- add eax, ebx
- mov pOutput, edx // Store back the pointers
- mov pInputB, esi
- mov pInputA, edi
}
}
} // code adapted from a performance test example provided by Richard Fuoco
- __asm emms // Clear the MMX Status
Now - Alpha MVI Code as an Alpha Assembler implementation
loopy:
- mov 486, t10 // top of for ( y=0;y<l_lSizeY )
- mov _ROUND_, t6
loopa:
- // we are just doing it as a do loop
- // and counting down from 486 to 0
- mov 720, t9 // pixels in line is 720
- srl t9, 2, t9 // processing 4 pixels at a time
- ble t9, finisha // test to see if we have some pixels
- // to process I know we will it's hard coded
- // in this example to (720 >> 2) or (180)
ldl
t4, 0(a0) // front
ldl
t5, 0(a1) // back
unpkbw t4, t0
// unpack four bytes
unpkbw t5, t2
// unpack four bytes
lda a2, 4(a2)
bgt t9, loopa
- lda t10, -1(t10) // bottom of for ( y=0;y<486;x++ )
- bgt t10, loopy // in this case it is a while loop
Additional techniques to optimize code
To get maximum performance, it is necessary to take advantage of parallelism and to optimize cache performance. Of course, one of the basic features of the Alpha processors is instruction level parallelism (ILP) or superscalar implementation. This means that multiple instructions can be issued in a single cycle. In addition, there is SIMD parallelism that is supported by MVI in the form of byte- and word-segmented instructions. This feature can be used in two ways: the CPU can do more operations per cycle or it can do the same number of operations with fewer registers. A third type of parallelism is software pipelining, which makes better use of the superscalar design.
One of the problems with using ILP is filling all of the issue slots with instructions that are ready and can be co-issued. It is often the case that different iterations of a loop will be independent or weakly dependent (the beginning of an iteration only depends on the beginning of an earlier iteration). The idea of software pipelining is to have more than one iteration of a loop executing at the same time. The iterations are initiated at constant intervals of cycles. For example, suppose the body of a loop can be decomposed into two parts A and B. The i-th iteration of the loop would be A(i)|B(i). Take the simplest case where different iterations of the loop are independent. If we rearrange the loop so that in one iteration we do B(i-1)|A(i), then we can further move code from B(i-1) so that it is executing in parallel with A(i). This has the potential for acommodating latency in both of these pieces. Startgin and stopping the loop is handled by a prologue which does A(0) and an epilogue which does B(N). An example of software pipelining can be found in the filtering code.
It is always important to know where your data is coming from. For example, it could be in the first level cache, the board level cache or memory. The single most important technique for improving memory performance is prefetching data. The amount of time to load data from memory can take 50-150 times longer than the time it takes to load data from the first level cache. This means that a program that is accessing a lot of data can spend most of its time waiting for the data to be loaded, and speeding up the numerical operations will have no effect.
There is another way in which accurate prefetching improves program performance. When a load is issued, one of the first things that happens is that a physical register is allocated to hold the resulting data. If the data is not in the D-cache (first level cache), then this register cannot be used for any other instruction until that value is filled. In some cases, the machine will stall on other instructions because there are no physical registers available. Prefetches do not use up any physical registers. In addition, loads or prefetches that miss in the D-cache require another resource called the missed address file (MAF). There are only eight of these in the 21264 and six in the 21164. If you issue more than eight, the extra ones will wait. This can add to the latency problem.
The following is a detailed description of how to use prefetching in your code. The prefetch instructions which will bring a cache block into the first-level cache are loads to either R31 or F31. The registers R31 for integer instructions and F31 for floating point instructions always contain the value zero regardless of what the program does. A load to one of these registers will cause the data to be fetched from memory or a second- or third-level cache if it is not in the first-level cache. The value is not actually put in a register, but it will be put in the first-level cache if it wasn’t there already. This is particularly important for the 21264. A prefetch looks like an ordinary load except the target register is either r31 or f31. There are four types of prefetch for the 21264. They are:
Instruction type
description
ldl
data will be evicted last
ldq
data will be evicted next
lds
data will be modified, so mark as dirty
wh64
the entire cache block will be modified, so
don't get it from memory
There is also a general rule about how far ahead to prefetch. If you have N streams of data that are being loaded or stored, then the number of cache blocks that you can prefetch in advance of a load for each stream is Min(2, 8/N). This roughly corresponds to the limitation of eight MAF's. For example, if there are two streams of data, then you should prefetch 4 blocks ahead or 256 bytes. Thus, if your code has a load:
ldq r2, 80(r1)
then you would add 80 to 256:
ldl r31, 336(r1)
Generally, the first three types of prefetch in the table above will not get you into any trouble even if you issue more than you really need. Also, they will not cause an exception if the address is not valid. However, the wh64 can have a bad side effect if you run off the end of your data because it will zero out the data at the address you give it if it is not in the L1 cache. The way to put prefetches into routines in "C" or FORTRAN is with asm's. For example, if you have a pointer which is the appropriate number of blocks ahead of the values you are fetching, then you can pass it to the assembly macro:
__asm("ldl $r31, %0", ptr);
There are some implementation differences between the 21164 and the 21264. For the 21164, the only difference between the two types is that "ldl r31" will always load the value into the first level cache, but "ldt f31" will load to the second level cache and will load to the first level cache if it does not interfere with other loads. This distinction is present in the 21164 but not the 21264. The lds is a prefetch with intent to modify, so this should be used if the program is going to read, modify, and store a new value in the same address. Ldq is a prefetch with "evict next" property. The first level cache for the 21264 is two-way associative, so there are two blocks associated with each index (cache line). The one that is normally evicted is the older of the two. This can be overridden with the "evict next" property. This is used when the data will fit in the second level cache but not in the first, so that other useful data will remain in the first level cache. The compiler will often insert these instructions into your code; however, it is always a good idea to check this by searching through the disassembly for them. Here is an example which illustrates the use of prefetching.
C code
- loop:
- ldt f1, 0(a0)
- ldt f2, 0(a1)
- lda a0, 8(a0)
- lda a1, 8(a1)
- mult f1, f0, f1
- addt f1, f2, f1
- stt f1, 0(a0)
- lda a2, -1(a2)
- bgt a2, loop
The loop can be improved by unrolling it four times and inserting prefetches for each of the arrays.
loop:
- ldt f1, 0(a0)
- ldt f2, 8(a0)
- ldt f3, 16(a0)
- ldt f4, 24(a0)
- ldt f5, 0(a1)
- ldt f6, 8(a1)
- ldt f7, 16(a1)
- ldt f8, 24(a1)
- mult f1, f0, f1
- mult f2, f0, f2
- mult f3, f0, f3
- mult f4, f0, f4
- addt f1, f5, f1
- addt f2, f6, f2
- addt f3, f7, f3
- addt f4, f8, f4
- stt f1, 0(a0)
- stt f2, 8(a0)
- stt f3, 16(a0)
- stt f4, 24(a0)
- ldl r31, 256(a1) # prefetch
- lds f31, 256(a0) # prefetch
- lda a2, -4(a2)
- lda a0, 32(a0)
- lda a1, 32(a1)
- bgt a2, loop
This program illustrates the use of prefetches, but the scheduling of instructions is far from optimal. Note that the prefetch is four cache blocks ahead of the actual loads.
Cache Locality
Programs that take advantage of cache locality will make more efficient use of the cache hierarchy. There are two types of cache locality: temporal and spatial. Temporal locality is an issue if the same data is accessed more than once. Loads and stores are often the source of large delays when the data is not in the cache, so the fewer the better and the greater the temporal locality the better, since data that is in the cache at one time instant is most likely going to be there a little bit later. Ideally, when performing multiple operations on data, it is best to get one block at a time and perform all of the operations then, as opposed to accessing the data multiple times and performing a few operations each time. While the former is often more efficient in terms of use of the machine, it is generally more difficult to code and the code is more difficult to read and debug. Nevertheless, temporal locality may make a big difference for sections of the code where most of the time is spent.
Spatial locality is also related to the order of accessing data. The order in which you access data should be compatible with the order in which it is stored. The worst case occurs when two blocks are accessed sequentially so that data in the second block is not in the cache when the first block is, and when it is brought into the cache it replaces the first block before its processing is finished.
An example which illustrates temporal locality issues is the following:
- for (I=0; I<32000; I++){
- b[I] = f(a[I],c1,c2,c3); // later values for b[I] will
- // evict earlier ones in cache
- .
- .
- .}
- for (I=0; I<32000; I++){
- c[I] = g(b[I],d1,d2,d3); // b[I] is no longer cached
- .
- .
- .}
The problem in this example is that a long time passes between when each b[I] is produced in the first loop and when it is used in the second. Suppose b[I] is a floating point value, which occupies eight bytes. Also assume that the cache holds 64K bytes. When the first loop exits, it will have stored 256K bytes of data, of which only the last 64K will still be in the cache. When the second loop starts, none of the data that it loads first is in the cache. In addition, the first 64K bytes that are loaded will evict the last 64K bytes that were left by the first loop, so none of the data will be coming from the cache. One way to solve this problem would be do loop fusion, which combines the two loops:
However, this solution may cause another problem. If the functions f and g require a lot of code, then the I-cache may not be large enough to hold both of them, so each loop iteration would require fetching them into the I-cache. It is also possible that there would not be enough registers to hold the constants and other intermediate results required for both function calls. In this case, these values would have to be stored. A better solution to the problem is cache blocking, which means that one block of the cache is processed at a time. An example of that is:
- for (I=0; I<32000; I++){
- b[I] = f(a[I],c1,c2,c3); // b[I] is stored into cache
- .
- c[I] = g(b[I],d1,d2,d3); // b[I] is loaded from cache
- .
- .
- .}
- for (I=0; I<32000/64; I++){
- for(J=I*64;j<I*64+63; j++){
- b[J] = f(a[J],c1,c2,c3); // b[j]’s in same cache block
- .}
- for(J= I*64;j<I*64+63; j++){
- c[J] = g(b[J],d1,d2,d3);
- .}
- }
Most caches are organized in 32 or 64 byte blocks. When a byte of data is requested from a cache farther down the hierarchy or memory, a block of data is fetched which contains that byte. If another byte is requested and it is in the same block as the first one, then it is already in the cache and doesn’t have to be fetched from farther away. This is spatial locality. It is most important to know the organization of the cache in order to take best advantage of its speed of access.
Alignment of Data
In this section, we give an example of how to align a block of data and how to load unaligned data without trapping. Alignment of data is a key issue when working with byte or word data. The most efficient way to fetch data is eight-bytes-at-a-time with an ldq. However, unless the data are aligned on a quadword boundary, the ldq will be trapped. The trap causes a significant slowdown. The following code excerpt aligns your data on a quadword boundary (the address is divisible by eight), so that ldq can be used without trapping:
p = (char *) malloc(n+63);
while (((unsigned long) p & 63) != 0) p++;
Note that the above code aligns the data so that the address is divisible by 64. Thus, it is aligned with cache block boundaries as well.
Even in the basic ALPHA instruction set, it is possible to process data in parallel (more than one byte or word at a time). The MVI instructions expand these capabilities.
If the data are inherently misaligned (e.g. they are in blocks whose size is not divisible by eight), they can still be fetched in groups of eight bytes or four words. The following code is taken from the ALPHA architecture manual:
Note: this assumes that the effective address of R11 is such that its remainder mod 8 is 5. The value of the aligned quadword containing (R11) is CBAx xxxx, and the value of the aligned quadword containing 7(R11) is yyyH GFED.
Applications: image filtering and motion estimation
Image filtering is convolution of an image with a two-dimensional mask of constants. Often, this convolution can be computed by two convolutions with one-dimensional masks. In this case, the two-dimensional mask is said to be separable. Since most masks that one encounters are separable, we will show how to use MVI in the case of a one-dimensional mask. Of course, this code can also be applied to one-dimensional signal filtering. It is assumed here that both the image and mask data are represented as bytes. This example shows how to use 16 bits to store the intermediate results, so we can only operate on 4 image pixels at a time. In addition, for some masks it is possible to avoid multiplies by doing shifts and adds. This is generally preferable since integer multiplies can have a latency of 15 cycles.
We give an example which uses a one-dimensional mask of length three and consists of the elements [M0 M1 M2]. In most cases the mask values are normalized so that the sum of their absolute values is unity. Based on this assumption, we represent the mask values as 8-bit binary fractions. The mask values are loaded into registers and kept there since they are re-used over the entire image. The general strategy is:
X X X X X X
- loop:
- sll t0, 48, t8 # get low byte from i-1
- sll v0, 32, t9 # get low 2 bytes from i-1
- ldl v0, (a0) # read 4 bytes
- unpkwb v0, v0 # unpack into word
- addq v0, v0, t0 # multiply middle by M1
- srl t0, 16, t10 # get high 3 bytes
- srl v0, 32, t11 # get high 2 bytes
- bis t8, t10, t8 # combine shifted data
- bis t9, t11, t9
- addq v0, t8, t8
- addq t8, t9, t8
- addq t8, rnd, t8 # round off
- srl t8, 2, t8 # divide by 4
- pkwb t8, t8
- stl t8, (a1)
Motion estimation is used for video encoding of macroblocks of an image in a video stream. MPEG1, MPEG2, and H.263 all have block-based motion estimation as a computationally demanding task. The assumption is that the images in a stream change smoothly and most of the image will be present in succeeding images, although it may have moved a little and it may have changed a little. Based on this assumption, given a macroblock, one can search the previous image to find the most similar macroblock. If a similar one is found, then it is only necessary to encode the motion of the macroblock and the difference in the values. MVI allows the program to measure similarity of blocks very quickly. In the following code fragment, we take an 8x8 block of pixels and look for the best match in a reference image. This 8x8 block can be thought of as eight quadwords: each quadword containing eight pixels. Conceptually, the inner loop takes the difference between 8 rows of one block and 8 rows of a subimage. The basic steps are
A global search would nest this inside a double loop indexed on the rows and columns of the search area of the reference or target image. One way to improve the efficiency is to search each vertical slice by simply loading one new quadword for each new iteration and re-use the seven that overlap. It is necessary to align the data from the previous image that is being matched, since most of the time, the starting point for the search will not be aligned on a quadword boundary (see section 4.4.4). The following code is scheduled in a simple way and does not incorporate the context in which it would be used. It is not the most efficient way to schedule the code. However, some care was used to issue the unaligned loads in different cycles to avoid trapping in the case that they reference the same quadword (a1 is divisible by eight), and the 21164PC (PCA-56) latency of two was used. Also, a PCA-56 can only issue one extract or shift in a cycle.
- ldq_u t8, (a1) # row 1 from previous frame
- ldq_u s5, 7(a1) # rest of row 1
- extql t8, AT, t8 # row 1 alignment process
- extqh s5, AT, s5 # row 1 cont'd
- ldq t0, (a0) # row 1, reference frame, aligned
- bis t8, s5, t8 # row 1 prev
- ldq_u t9, 8(a1) # row 2 prev
- ldq_u s5, 15(a1) # don't co-issue loads
- extql t9, AT, t9 # alignment process
- extqh s5, AT, s5
- ldq t1, 8(a0) # row 2 ref
- bis t9, s5, t9
- ldq_u t10, 16(a1) # row 3 prev
- ldq_u s5, 23(a1) # row 3 prev
- extql t10, AT, t10
- extqh s5, AT, s5
- ldq t2, 16(a0) # row 3 ref
- bis t10, s5, t10
- ldq_u t11, 24(a1) # row 4 prev
- ldq_u s5, 31(a1) # row 4 prev
- extql t11, AT, t11
- extqh s5, AT, s5
- ldq t3, 24(a0) # row 4 ref
- bis t11, s5, t11
- ldq_u t12, 32(a1) # row 5 prev
- ldq_u s5, 39(a1) # row 5 prev
- extql t12, AT, t12
- extqh s5, AT, s5
- ldq t4, 32(a0) # row 5 ref
- bis t12, s5, t12
- ldq_u s0, 40(a1) # row 6 prev
- ldq_u s5, 47(a1) # row 6 prev
- extql s0, AT, s0
- extqh s5, AT, s5
- ldq t5, 40(a0) # row 6 ref
- bis s0, s5, s0
- ldq_u s1, 48(a1) # row 7 prev
- ldq_u s5, 55(a1) # row 7 prev
- extql s1, AT, s1
- extqh s5, AT, s5
- bis s1, s5, s1
- ldq t6, 48(a0) # row 7 ref
- ldq_u s2, 56(a1) # row 8 prev
- ldq_u s5, 63(a1) # row 8 prev
- ldq t7, 56(a0) # row 8 ref
- extql s2, AT, s2
- perr t0, t8, v0 # first of 8 perr
- extqh s5, AT, s5
- perr t1, t9, s3
- bis s2, s5, s2
- perr t2, t10, s4
- addq v0, s3, v0 # latency 2 for MVI
- perr t3, t11, s3
- addq v0, s4, v0
- perr t4, t12, s4
- addq v0, s3, v0
- perr t5, s0, s3
- addq v0, s4, v0
- perr t6, s1, s4
- addq v0, s3, v0
- perr t7, s2, s3
- addq v0, s4, v0
- addq v0, s3, v0
The order in which you issue instructions can have an effect on the speed that your code executes, even on the 21264, which has out-of-order (OoO) execution of instructions. The main factor to consider is the latency of each instruction. On an in-order machine such as the 21164PC, if an instruction is issued before its input arguments are ready, the machine will have to stall until they are ready. In the case of OoO execution, the machine may continue to execute instructions, but physical registers will be allocated for the pending instructions, and the machine may stall if it runs out of physical registers.
MVI instructions have a latency of 2 for the 21164PC and 3 for the 21264, and they can be issued every cycle. For the 21264, there is a return slot two cycles after an MVI instruction is issued, and only another MVI can be issued in this slot. In the following code fragments the main point is to pay attention to the MVI instructions and when the results are used. The other instructions are filling the remaining issue slots. Here is a code fragment written for the 21164PC:
Here is a code fragment written for the 21264:
- perr t0, t8, s0 // MVI instruction
- addq v0, AT, v0 // non-MVI instruction
- perr t1, t9, s1 // MVI
- addq a5, AT, a5 // non-MVI
- perr t2, t10, s2 // MVI
- addq v0, s0, v0 // use result of first perr
- perr t0, t8, s0 // MVI instruction
- addq v0, AT, v0 // non-MVI instruction
- ldq AT, (a0) // non-MVI
- sll t1, 16, t1 // non-MVI
- perr t1, t9, s1 // MVI
- addq a0, AT, a0 // non-MVI
- ldq AT, 8(a1) // non-MVI
- sll t2, 16, t2 // non-MVI
- perr t2, t10, s2 // either an MVI instruction or no instruction
- addq a0, AT, a0 // non-MVI
- ldq AT, 16(a1) // non-MVI
- sll t3, 16, t3 // non-MVI
- perr t3, t11, s3 // either an MVI instruction or no instruction
- addq v0, s0, v0 // use result of first perr
- ldq AT, 24(a1) // non-MVI
- sll t4, 16, t4 // non-MVI
MVI Code Examples
http://www.digital.com/semiconductor/alpha/mvi-code-ex.htm
http://www.europe.digital.com/semiconductor/alpha/mvi-code-ex.htm
Digital Semiconductor Reference Library
http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
Issues in Multimedia Authoring
http://www.cs.sfu.ca/cs/CC/365/mark/material/notes/Chap2/Chap2.html
Motion Video Instruction Extensions for Alpha. (White Paper)
http://www.digital.com/semiconductor/papers/pmvi-abstract.htm
Advanced Technology for Visual Computing: Alpha Architecture with MVI (White Paper)
http://www.digital.com/semiconductor/mvi-backgrounder.htm
Parallelized Algorithms ( class notes )
http://ozone.crle.uoguelph.ca/chris/reportC.html
CCITT H.261 - Understanding H.261 Image Compression
http://www.igd.fhg.de/icib/telecom/ccitt/rec_h.261-1990/pvrg-descript/chapter2.5.html
Intro to YUV <> RGB
http://www.webartz.com/fourcc/
Basics of Device Independent Color
http://www.cs.sfu.ca/cs/CC/365/mark/material/notes/Chap3/Chap3.3/sRGB/sRGB.html
Software developers’ homepage with lots of kits. Look for Spike.
http://www.digital.com/semiconductor/alpha/developers.htm
Digital continuous profiling infrastructure
http://www.research.digital.com/SRC/dcpi/
Alpha Architecture Handbook
http://ftp.digital.com/pub/Digital/info/semiconductor/literature/alphahb2.pdf