MVI

Motion Video Instructions (MVI)

James Hicks

Richard Weiss

Compaq Computer Corporation

February 1999

Key words: video, multimedia, high performance computing, prefetching, motion estimation, filtering, Alpha processor

In this paper you will find

What is Alpha
What is MVI
How to use this guide
When to use MVI
Detecting MVI capability
Inline Assembler MACRO
MVI Instruction set description
Portable coding Techniques
How MVI instructions are used in software
A Side-by-Side Example of Alpha MVI and x86 MMX
Additional techniques to optimize code
Prefetching
Applications: image filtering and motion estimation
Motion Estimation
Scheduling Issues
Helpful URLs

What is Alpha?

Alpha is a 64-bit processor architecture. The architecture is defined by a living architectural specification. It is intended to have multiple implementations over many years; the current implementations are all superscalar, which means that multiple instructions are issued in the same cycle. For the 21164 chip, up to four instructions can be issued in a cycle, and for the 21264, up to six instructions can be issued. For each implementation there are rules that constrain the instructions that can be co-issued. For example, the 21264 has four ALU’s on the integer side, but only one of them can execute an integer multiply, so at most one integer multiply instruction can be issued in a cycle.

What is MVI?

Motion Video Instructions or MVI are the DIGITAL Alpha’s Multi Media instructions. They are a set of Alpha processor instructions that use a single instruction to operate on multiple data in parallel (SIMD). This is accomplished by partitioning a 64bit Quadword into a vector of 8 separate bytes or 4 separate words (16bit). Any code that is capable of taking advantage of this parallelism can achieve up to an 8X performance boost. The instructions are:

MINUB8, MAXUB8 unsigned byte minimum/maximum
MINSB8, MAXSB8 signed byte minimum/maximum
MINUW4, MAXUW4 unsigned word minimum/maximum
MINSW4, MAXSW4 signed word minimum/maximum
PKWB, UNPKBW    pack/unpack word to byte
PKLB, UNPKBL    pack/unpack long to byte
PERR            pixel error

How to use this guide

This guide is an introduction to some techniques in high performance computing with a focus on MVI. There are a few key techniques for designing efficient code for an Alpha processor. This first is to make sure that your data is in the first level cache when you want to use it. The sections on prefetching and cache blocking explain this. The second is to take advantage of instruction level parallelism. This is explained in the sections on software pipelining and scheduling issues. The third technique is to take advantage of data parallelism. This is what MVI is about. The relevant section below explains how to detect if a processor supports MVI. This will allow the programmer to branch to alternate implementations, so that legacy code can be extended. Each of the MVI instructions is described with a sample trace to show how it works. There are many examples of how MVI instructions are used and a side-by-side example of MVI code with MMX code. There are also examples of how to use MVI for motion estimation in an MPEG encoder and for image filtering. Since MVI is not currently supported by commercial "C" compilers, it will be necessary to include some assembly code to use it. This is relatively easy and can be done with inline assembler MACRO. This allows the programmer to use assembly language instructions as if they were C statements. This can be very useful not only for MVI but for prefetching and other techniques as well.

The techniques described in this guide can be used with OpenVMS, Digital UNIX, or Windows NT. Some of the details may change with different operating systems.

When to use MVI

While these new instructions were intended to implement high quality software video encoding like MPEG-1,MPEG-2, H.261 (ISDN video conferencing) and H.263 (Internet video conferencing) only your imagination as a software engineer will limit their uses. Anytime data can be operated on in parallel you will see the benefit. Desktop Video Publishing, Video Conferencing, Internet Commerce and Interactive Training are some target trends in visual computing.

If the application processes a large amount of data as fast as possible (high memory bandwidth) and the data are all 8bit or 16bit integers (Byte/Word Integer Data) and the same operations are performed on all the data (Parallelism), then you definitely want to explore MVI. MVI can make a critical difference in achieving video-rate encoding.

Detecting MVI Capability

The 21164PC has the first implementation of MVI. All Alpha processors since that one including the 21264 have MVI, and all future processors will implement it.

Before taking advantage of MVI instructions one must be running on hardware that supports these extensions. This can be determined at run time by looking at bit eight in the value the AMASK instruction returns. Future extensions may use other AMASK bits. The code to do this is written in assembly language rather than C or C++ since there is no statement that corresponds to AMASK. Assembly language code in can be inserted into the object file using the macro __asm. A detailed description of the instruction can be found in the Alpha Architecture Handbook.

The AMASK instruction takes three register arguments. The first register (Ra) must be R31. The second register (Rb) has the input mask, which represents the requested architectural features. There is a one in every bit position that is being queried. The third register (Rc) is the output. Bits are cleared that correspond to architectural extensions that are present. Reserved bits and bits that correspond to absent extensions are copied unchanged. If the result is zero, all requested features are present. Software may specify an input mask of all 1’s to determine the complete set of architectural extensions implemented by a processor. Assigned bit definitions are defined below.

AMASK Bit Assignments

Bit 0. Support for the byte/word extension (BWX) The instructions that comprise the BWX extension are LDBU, LDWU,SEXTB,SEXTW,STB, and STW.

Bit 1. Support for the count extension (CIX) The instructions that comprise the CIX extension are CTLZ,CTPOP,CTTZ,FTOIS,FTOIT, ITOFF, ITOFS, ITOFT, SQRTF, SQRTG,SQRTS, and SQRTT.

Bit 2. Support for CIX instructions (not including SQRTG,SQRTS, and SQRTT).

Bit 8. Support for the multimedia extension (MAX) The instructions that comprise the MAX extension are MAXSB8, MAXSW4, MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB, PKWB, UNPKBL, and UNPKBW.

Bit 9. Support for Precise arithmetic trap reporting

Software Note:

Use this instruction to make instruction-set decisions; use IMPLVER to make code-tuning decisions.

Implementation Note:

Instruction encoding is implemented as follows: On 21064/21064A/21066/21068/21066A (EV4/EV45/LCA/LCA45 chips), AMASK copies Rbv to Rc.
On 21164 (EV5), AMASK copies Rbv to Rc.
On 21164A (EV56), 21164PC (PCA56), and 21264 (EV6), AMASK correctly indicates support for architecture extensions by copying Rbv to Rc and clearing appropriate bits.

AMASK Code Examples.

Assembler file or "S" file

This short piece of code is an assembler file that can be made into an .obj file and linked to a "C" or "C++" program.

Notice #include <kxAlpha.h>. This header file comes with the VC RISC edition compiler and the Microsoft SDK. It is full of some very interesting information about the DIGITAL Alpha. There you can find the expansion of the LEAF_ENTRY() MACRO, which defines an entry point for the compiler and initializes the stack pointer.

// amask.s
//
//-------------------------------------------------------------------------------------
// "C" declaration
// extern __int64 getAMASK( __int64 mask )
//
// "C++" declaration
// extern "C" { __int64 getAMASK( __int64 mask ); };
//
// mask bits are passed in a0
// results of amask passed back in v0
//
// use asaxp.exe to make an .obj file from this source and link it to
// your "C" program. For example, if you are using Visual C++, custom compile using the
// command asaxp /O0 $(InputDir)\amask.s –o ($OutDir)\amask.obj
//
// c:\your-command-line>asaxp amask.s
//-------------------------------------------------------------------------------------
//
//-
#include <kxalpha.h>
LEAF_ENTRY(getAMASK)
amask a0,v0
ret zero, (ra)
.end getAMASK

//+
// calling from "C" program
//-
extern __int64 getAMASK( __int64 mask );

void main()
{

}

Inline Assembler MACRO (asm's)

This section describes how to use the function "__asm" to include assembly language code in your object file. Here it is used to insert the AMASK instruction and test bits in the mask returned, but it can be used more generally when assembly code is more suitable than C or C++. First it is necessary to convince the compiler that there is a function called "__asm." We do that with the extern long __asm( char *, … ); declaration. This is saying that __asm is some function that returns a long – on an Alpha any integer value returned from a function will be in the v0 register and the compiler knows that.

Next we declare the __asm function to be intrinsic. That means it will be implemented in native assembler word for word. This can be seen in the line below that reads #pragma intrinsic __asm

Following this there are a couple of #define’s for several of the possible bit mask used to determine the availability of a desired extension. The bit we are concerned with is bit 8 and is defined as SUPPORT_FOR_MVI ((__int64)((0x0100))

Finally the MACRO definition is provided. This MACRO uses inline assembler to invoke the "amask" instruction. The inline code alone is as follows: __asm("amask $a0,$v0",(x) ) where (x) represents some 64bit bit mask with the bits we are interested in set or cleared as appropriate. One could just as easily write someInt64 = __asm("amask $a0,$v0", 0x0100 );, then test the value of the variable someInt64 == 0 indicating MVI is supported.

Let’s take a moment to review this inline assembler code.

The string "amask $a0,$v0" is indicating that register a0 contains the bitmask required by the amask instruction and that register v0 will receive the result. This is just as we had written this code in Alpha assembler as amask a0,v0. The last parameter in the __asm() "function" call is the bitmask value that will be placed in the a0 register before the amask instruction is executed. For UNIX, the "function" is asm() without the "__".

The MACRO "thisAlphaHas()" contains a logical NOT (!) to get the code to read and function with positive logic allowing a programmer to write

if ( thisAlphaHas( SUPPORT_FOR_MVI ) )

//+
// Detect if this Alpha Supports MVI
//-
extern long __asm(char *, …);
#pragma intrinsic (__asm)

#define SUPPORT_FOR_BWX ((__int64)(0x0001)) // bit0
#define SUPPORT_FOR_CIX ((__int64)(0x0002)) // bit1
#define SUPPORT_FOR_MVI ((__int64)(0x0100)) // bit8
#define thisAlphaHas(x) (!(__asm("amask $a0,$v0",(x))))

void main( )
{
}

//+
// quick and dirty console app to detect hardware capabilities uses AMASK and IMPLVER
// you should be able to block copy this – include some header files and build it.
//-
int Verbose = 0;
int Quiet = 0;
int Debug = 0;

main(int Argc, char **Argv)
{

__int64 amask, implver;

int i;

char opt;

char *implname[]={"21064 (EV4)", "21164 (EV5)", "21264 (EV6)"};

#define MAXAMASK 16

char *amaskname[]={
"BWX (Byte Word extensions)",

while((opt = getopt(Argc, Argv, "vqd?")) != -1) {

switch (opt) {

case 'v': Verbose = 1; break;

case 'q': Quiet = 1; break;

case 'd': Debug = 1; break;

default:

printf("cputype [options]\n/v Verbose\n/q Quiet\n/d Debug\n");

exit(1);

}

}

amask = ~__asm("amask $a0, $v0", -1);

implver = __asm("implver $v0");

if (!Quiet) {

if (Debug)

printf("Implver = %d\n", implver);

printf("Implementation version: %s\n\n", implname[implver]);

if (Debug)

printf("Amask = 0x%x\n", amask);

printf("Architecture mask\n");

}

for (i=0;i<MAXAMASK;i++) {

if (Verbose && (strncmp(amaskname[i], "unused", 6)))

printf(" %s: %s\n", amaskname[i], (amask & (1<<i))?"YES":"NO");

else if (!Quiet && (amask & (1<<i)))

printf(" %s\n", amaskname[i]);

}

} // code courtesy of Dave Wagner

MVI Instruction set description

Alpha MVI code and x86 MMX code are used in the following examples. No judgements or comments about relative performance of similar code are made or implied. These two architectures are very different in their respective implementations. Keep in mind that RISC architectures in general will have more assembler instructions representing fewer clock cycles. It is likewise important to remember that if used improperly these instructions can stall both chips. These examples are in no way – the best way. Refer to the appropriate documentation for the best information on optimizing your code and preventing stalls. These are brute force examples of functionality only.

Unpack Instructions

UNPKBW – Unpack Bytes to Words expands the low four bytes in a quadword to four words in a quadword. This implies two unpacks to re-acquire all eight pixels.

Packed Byte: 8 bytes packed into 64bits

Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1

Alpha MVI code

unpkbw    r1, r5       // unpack the low four bytes into reg r5

srl       r1, 32,   r1 // shift reg r1 right 32 bits, result in r1

unpkbw    r1, r6       // unpack the high four bytes into reg r6

x86 MMX code

pxor      mm6, mm6     // set mm6 = 0

movq      mm2, mm1     // mm2 = mm1

punpcklbw mm1, mm6     // unpack low four bytes into mm1

punpcklbh mm2, mm6     // unpack high four bytes into mm2

Alpha REGISTER r5 after unpkbw r1,r5
Bit63 Bit0

Pixel 3 Pixel 2 Pixel 1 Pixel 0

Alpha REGISTER r1 after srl 32
Bit63 Bit0

0 0 0 0 Pixel 7 Pixel 6 Pixel 5 Pixel 4

Start with data as shown in this table representing a 64bit quadword with eight pixels represented

Alpha REGISTER r6 after unpkbw r1,r6
Bit63 Bit0

Pixel 7 Pixel 6 Pixel 5 Pixel 4

UNPKBL – Unpack Bytes to long words expands the low two bytes in a quadword to two long words in a quadword. This implies four unpacks to re-acquire all eight pixels. While this use would be used much less often here is how it looks.

Packed Byte: 8 bytes packed into 64bits

Alpha REGISTER r1 at start
Bit63 Bit0

Pixel 7 Pixel 6 Pixel 5 Pixel 4 Pixel 3 Pixel 2 Pixel 1 Pixel 0

Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1

Alpha MVI code

unpkbl r1, r5      // unpack the two low bytes into register r5

srl    r1, 16, r1 // shift register r1 right 16 bits

unpkbl r1, r6      // unpack the next two bytes into register r6

srl    r1, 16, r1 // shift register r1 right 16 bits

unpkbl r1, r7      // unpack the next two bytes into register r7

srl    r1, 16, r1 // shift register r1 right 16 bits

unpkbl r1, r8      // unpack the next two bytes into register r8

x86 MMX code

pxor      mm6, mm6 // clear mm6

movq      mm2, mm1 // mm2 = mm1

punpcklbw mm1, mm6 // unpack low four bytes into mm1

punpcklbh mm2, mm6 // unpack high four bytes into mm2

movq      mm3, mm1 // mm3 = mm1

punpcklwd mm1, mm6 // unpack low two words into mm1

punpcklwd mm3, mm6 // unpack high two words into mm3

movq      mm4, mm2 // mm4 = mm2

punpcklwd mm2, mm6 // unpack low two words into mm2

punpcklwd mm4, mm6 // unpack high two words into mm4

Alpha REGISTER r5
Bit63 Bit0

Pixel 1 Pixel 0

Alpha REGISTER r6
Bit63 Bit0

Pixel 3 Pixel 2

Alpha REGISTER r7
Bit63 Bit0

Pixel 5 Pixel 4

Alpha REGISTER r8
Bit63 Bit0

Pixel 7 Pixel 6

Pack Instructions

PACKWB – Truncates the four (4) component words of the input register and writes them to the low four (4) bytes of the output register.

Alpha REGISTER r5 at start of packwb

Bit63 Bit0

Pixel 3 Pixel 2 Pixel 1 Pixel 0

Alpha REGISTER r6 at start of packwb
Bit63 Bit0

Pixel 7 Pixel 6 Pixel 5 Pixel 4

Alpha MVI code

packwb r6, r6 // pack four words (4) in low four (4) bytes Hi; packwb r5, r5 // pack four words (4) in low four (4) bytes Lo
sll r6, 32, r6 // shift the high four left: bis r5, r6, v0 // logical OR the two halves into v0

x86 MMX code

packuswb mm0, mm1 // truncate and pack mm0 and mm1 into mm0

SPECIAL NOTE: the following three lines of MVI code will not work because the upper four (4) bytes of the destination register are written with zero’s by packwb. Therefore the second packwb would write zero’s over the first pixels shifted data.

packwb r6, r6 // get high four (4) bytes
sll r6, 32, r6 // shift them over: packwb r5, r6 // get low four (4) bytes will overwrite; // the high four bytes with zero’s

Alpha REGISTER r5 after packwb r5,r5
Bit63 Bit0

Pixel 3

Pixel 2

Pixel 1

Pixel 0

Alpha REGISTER r6 after packwb r6,r6
Bit63 Bit0

0 0 0 0 Pixel 7 Pixel 6 Pixel 5 Pixel 4

Alpha REGISTER r6 after sll r6, 32, r6
Bit63 Bit0

Pixel 7 Pixel 6 Pixel 5 Pixel 4 0 0 0 0

Alpha REGISTER v0 after bis r5,r6,v0
Bit63 Bit0

Pixel 7 Pixel 6 Pixel 5 Pixel 4 Pixel 3 Pixel 2 Pixel 1 Pixel 0

PACKLB – Truncates the two (2) component long words in the input register to byte values and writes them to the low two (2) bytes of the output register.

Alpha REGISTER r5 at start
Bit63 Bit0

Pixel 1 Pixel 0

Alpha REGISTER r6 at start
Bit63 Bit0

Pixel 3 Pixel 2

Alpha REGISTER r7 at start
Bit63 Bit0

Pixel 5 Pixel 4

Alpha REGISTER r8 at start
Bit63 Bit0

Pixel 7 Pixel 6

Alpha MVI code

packlb r5, r5     // trunc and pack 2 dwords into 2 bytes

packlb r6, r6     // trunc and pack 2 dwords into 2 bytes

sll     r6, 16, r6 // shift pixels 3 and 2 into place

bis     r5, r6, r5 // logical OR pixels 3 and 2 into r5

packlb r7, r7     // trunc and pack 2 dwords into 2 bytes

sll     r7, 32, r7 // shift pixels 5 and 4 into place

bis     r5, r7, r5 // logical OR pixels 5 and 4 into r5

packlb r8, r8     // trunc and pack 2 dwords into 2 bytes

sll     r8, 48, r8 // shift pixels 7 and 6 into place

bis     r5, r8, r5 // logical OR pixels 7 and 6 into r5

Alpha REGISTER r5 after packwl r5,r5
Bit63 Bit0

0 0 0 0 0 0 Pixel 1 Pixel 0

Alpha REGISTER r6 after packwl r6,r6
Bit63 Bit0

0 0 0 0 0 0 Pixel 3 Pixel 2

Alpha REGISTER r6 after sll r6,16,r6
Bit63 Bit0

0 0 0 0 Pixel 3 Pixel 2 0 0

Alpha REGISTER r5 after bis r5,r6,r5
Bit63 Bit0

0 0 0 0 Pixel 3 Pixel 2 Pixel 1 Pixel 0

Alpha REGISTER r7 and r8 just repeat this sequence after shifting the correct number of bits.

Byte and Word Minimum and Maximum (MINxxx) (MAXxxx)

MINUB8 Vector Unsigned Byte Minimum

MINUW4 Vector Unsigned Word Minimum

MINSB8 Vector Signed Byte Minimum

MINSW4 Vector Signed Word Minimum

MAXUB8 Vector Unsigned Byte Maximum

MAXW4 Vector Unsigned Word Maximum

MAXSB8 Vector Signed Byte Maximum

MAXSW4 Vector Signed Word Maximum

These instructions take the form MINxxx Ra,Rb,Rc

MAXxxx Ra,Rb,Rc

Where the values in register Ra are compared to the value in register Rb and the result is placed in register Rc. These are vector wise comparisons. That is each byte or each word is compared when using the xxxxB8 or xxxxW4 instructions respectively.

Alpha MVI code

minub8 r5, r6, r6 // get the minimum’s in each BYTE position

Alpha REGISTER r5 at start

1 0 1 0 1 0 1 0

Alpha REGISTER r6 at start

0 1 2 2 0 0 1 1

Alpha REGISTER r6 after calling minub8 r5,r6,r6

0 0 1 0 0 0 1 0

Alpha MVI code

minuw4 r5, r6, r6 // get the minimum’s in each WORD position

Alpha REGISTER r5 at start

0x0000 0x00FF 0x0000 0x0001

Alpha REGISTER r6 at start

0x0000 0x0001 0x0000 0x00F3

Alpha REGISTER r6 after call to minuw4 r5,r6,r6

0x0000 0x0001 0x0000 0x0001

PERR (Pixel Error)

This instruction takes the eight bytes packed into two quadword registers and computes the absolute differences between them, then adds the eight intermediate results and right aligns the result in the destination register. The net result is that motion estimation calculations on eight pixels can be done in a single clock tick on an MVI capable Alpha.

Paul Rubinfeld, Bob Rose and Michael McCallig present several very good examples of applications that use PERR in their paper entitled "Motion Video Extensions for Alpha." Look in Helpful URL’s for more information.

Alpha MVI code

perr r5,r6,v0

x86 MMX code

movq mm2,mm0 // copy mm0 to mm2

psubusb mm0,mm1 // compute difference one way

psubusb mm1,mm2 // compute difference the other way

por mm0,mm1 // OR the results together

loop: // perform some loop or other logic to add the 8 bytes

// in a quadword together and place the result in a

// quadword register

Alpha REGISTER r5 at start

1 0 1 0 1 0 1 0

Alpha REGISTER r6 at start

0 1 2 2 0 0 1 1

Intermediate Absolute Differences

1 1 1 2 1 0 0 1

Sum of Absolute Differences placed in Alpha REGISTER V0 (64bit)

0 0 0 0 0 0 0 7

How MVI instructions are used in software

In this section, we give some guidelines and examples for using the instructions above. The first issue is whether you are writing code for just one type of machine or not. The section on Detecting MVI
was usful for determining at runtime whether the current Alpha machine has MVI. You might be writing the same code for multiple architectures,
so you may need to test first to determine if the current machine is an Alpha. The following section on potable coding explains that. The other subsections describe how to do a bytewise add with saturation as well as a simple convolution filter.

Portable Coding Techniques

//+
// somewhere in your code declare a function pointer
//-
void (*videoProcessingFunctionPointer)( unsigned char *someParameter );
//+
// then declare your multimedia functions using any combination of
// MMX – MVI – inline assembler – or whatever
// make sure you surround them with the #ifdef _M_ALPHA MACRO
//-
#if ((defined(_M_ALPHA) || defined(_alpha)) && ( defined(_MSC_VER) || defined(__DECC))

//+
// declare your function that uses MVI
//-
void functionUsingMVI( unsigned char *someData)
{

}

//+
// declare your function that does not use MVI
//-
void functionWithoutMVI( unsigned char *someData)
{

}

#else

//+
// declare your x86 version that uses MMX instructions
//-
void FunctionUsingMMX( unsigned char *someData )
{

}

#endif



//+
// somewhere in the initialization portion of you program
//-
#if ((defined(_M_ALPHA) || defined(_alpha)) && ( defined(_MSC_VER) || defined(__DECC))

//+

// setup the function pointer by testing for MVI support

//-

if ( thisProcessorHas( SUPPORT_FOR_MVI ) )

    //+

    // we can use MVI

    // point the function pointer at the MVI function

    //-

    videoProcessingFunctionPointer = functionUsingMVI;

else

    //+

    // we can not use MVI

    // point at the NON MVI version of the function

    //-

    videoProcessingFunctionPointer = functionWithoutMVI;


#else

//+

// we are building for x86 MMX point at that code

//-

videoProcessingFunctionPointer = FunctionUsingMMX;


#endif

// THEN FINALLY IN THE PROGRAM BODY

getMyData( &myVideoData ); // get some data

//+
// call the desired function through the pointer
// it will be pointing at the best fit function on this platform
//-
(*videoProcessingFunctionPointer)( &myVideoData ); // our multimedia code

processMyDataSomeMore( &myVideoData );
displayMyData( &myVideoData );

Unsigned Saturating Arithmetic

The current versions of MVI do not have explicit byte- or word-partitioned arithmetic operations, but it is possible to produce the same results with multiple instructions. First we explain why one would want a saturating add. Consider the following situation. Suppose we want to blend two pixels by adding their red, green, and blue values together and dividing by two (shift right one bit). For example, if a simple add is applied to the 16-bit values shown below and then truncated to 16 bits the answer is actually less than both of the addends, and the shifted value will not be the average. The rest of the answer is in the most significant bit or in this case bit 17. Applying this to the green component of a pixel’s color value (assuming that 16 bits are used to represent each color), blending these two medium green pixels together would result in a pixel that is lighter green than the original two. Additionally, since pixel intensities or colors can only range from all of a color - 0xFFFF to none of a color – 0x0000, bit 17 is meaningless to us in terms of "greenness."

Simple Add 1111 0000 0000 0000 (0xF000)

+ 0011 0000 0000 0000 (0x3000)

Result truncated to 16bits 0010 0000 0000 0000 (0x2000)

Right shifted value 0x1000 (not the result we desire)

Unsigned saturating arithmetic causes values that would overflow to be "clamped" to the maximum possible value (0xFFFF) and values that would underflow to be "clamped" to the minimum (0x0000). A saturating add of these two green pixels overflows 16 bits and therefore gets "clamped" to a maximum value of 0xFFFF which is a much better representation of the color expected after blending two medium green pixels.

Saturating Add 0xF000

+ 0x3000

result "clamped" to maximum 0xFFFF

right shift with rounding 0x8000 ( much better )

In practice, a better solution might be to first do the division by two and then add the results, in which case overflow is not a real problem. However, the conversion from YUV video representation to RGB representation is a problem that requires saturated arithmetic, where the data are all unsigned byte. The equation for this conversion is given by

R = 1.1644*Y + 1.5966*V – 16 If Y and V are both close to 255, then the result will be greater than 255, and saturated arithmetic will be needed. Another example is pixel filtering, which is discussed below.

Alpha MVI "unsigned saturating add for packed words"

eqv r6, zero, t0 // 1’s complement of r6

minuw4 r5, t0, r5 // get the smaller values

addq r5, r6, v0 // add r6 to r5 and place in v0

Note that for unsigned packed bytes, just replace minuw4 with minub8.

Alpha REGISTER r5 at start
Pixel 3 Pixel 2 Pixel 1 Pixel 0

0x0000 0xFFFF 0x0000 0x0001

Alpha REGISTER r6 at start

0x0000 0x0001 0x0000 0xFFFF

What happens?

First we take the 1’s complement of register r6. The "eqv" instruction does a more general operation than implied by the comment. The "eqv" instruction alone is Rc<- Ra XOR (NOT Rb), therefore when Rb is zero (NOT Rb) is all 1’s and that XOR anything will flip the bits or produce the 1’s compliment. Why do we want the 1’s compliment of r6? Well first of all it would work with either r5 or r6 as long as we called minuw4 using whichever register we took the 1’s compliment of. The 1’s compliment is the largest number we can add to the original number and not overflow. Knowing that piece of information, we simply pick the smaller of this pixels 1’s compliment or the corresponding pixels original value and do the addition. Since we picked the smallest number from a set of numbers containing the largest value that would not overflow, we are guaranteed not to overflow.

Alpha REGISTER t0 after call to eqv r6,zero,t0

0xFFFF 0xFFFE 0xFFFF 0x0000

Alpha REGISTER r5 after call to minuw4 r5,t0,r5

0x0000 0xFFFE 0x0000 0x0000

Alpha REGISTER v0 after call to addq r5,r6,v0

0x0000 0xFFFF 0x0000 0xFFFF

Alpha MVI "unsigned saturating subtract of packed words"

minuw4 r5, r6, r6 // get the minimums at each word

subq r5, r6, v0 // subtract r6 from r5 and place in v0

Alpha REGISTER r5 at start
Pixel 3 Pixel 2 Pixel 1 Pixel 0

0x0000 0x00FF 0x0000 0x0001

Alpha REGISTER r6 at start

0x0000 0x0001 0x0000 0x00F3

Alpha REGISTER r6 after call to minuw4 r5,r6,r6

0x0000 0x0001 0x0000 0x0001

What happens?

When we started Pixel 0 or the low word in r5 = 0x0001 and in r6 = 0x00F3 subtracting r6 from r5 would result in a value that exceeds the limits of an unsigned 16-bit word i.e. something negative (-242). Recall the behavior of "clamping" to the minimum value – that is the effect achieved when all the minimum pixel values are placed in the r6 register and then subtracted from the original r5 register values. If the minimum pixel value was in r6 as is the case with Pixel 2, then the simple unsigned subtract results in a value that is less than the original but greater than the minimum value. If however the minimum value was in r5 then that value is moved to r6 by the minuw4 instruction and ultimately subtracted from itself resulting in zero, "clamped" to the minimum unsigned value.

Alpha REGISTER V0 after call to subq r5,r6,v0
Pixel 3 Pixel 2 Pixel 1 Pixel 0

0x0000 0x00FE 0x0000 0x0000

Simple Pixel Filtering Example

Consider a two dimensional array or bitmap of 32bit pixel values arranged as RGBA. Where the RGBA (Red Green Blue and Alpha) values are represented in the four 8bit components of a 32bit unsigned long. A simple convolution filter will replace each pixel by the weighted sum of the values of the pixels surrounding it. This is done independently for RGBA. A typical filter algorithm might employ a loop as in the following "C" language example. The algorithm will have high memory bandwidth. It is Byte/Word Integer Data and can take advantage of parallelism. These are the qualifiers for using MVI.

Loop on row

Loop on column

Red = 0; // clear accumulators

Green = 0;

Blue = 0;

Alpha = 0;

// loop on some filter length and pull out the RGBA components
for ( x=0 x< length of filter ; x++) {

temp = ((inputArray[ row ] [ col ] >> 24 ) && 0xFF);

Red += temp * filterValue[x];

temp = ((inputArray[ row ] [ col ] >> 16 ) && 0xFF);

Green += temp * filterValue[x];

temp = ((inputArray[ row ] [ col ] >> 8 ) && 0xFF);

Blue += temp * filterValue[x];

temp = ((inputArray[ row ] [ col ] ) && 0xFF);

Alpha += temp * filterValue[x];

}

Alpha MVI code stub (the result is left in a 16 bit value which would need to be shifted and packed into 8 bits again)

//+

// work on 2 pixels at a time ( 64bits)

//-

addl    a1, t8, t11 // t11=addr of data t8=offs to current pixels (2)

ldq     t4, 0(t11) // get 2 pixels from array

unpkbw t4, t0      // unpack low four bytes RGBA

srl     t4, 32, t4 // shift the high four bytes

unpkbw t4, t1      // unpack the high four bytes RGBA

//+

// t5 holds the filter value

//-

mulq   t5, t0, t0   // multiply RGBA of pixel 1 by filter value

mulq   t5, t1, t1   // multiply RGBA of pixel 2 by filter value

//+

// unsigned saturating adds for RGBA accumulators

// r6 pixel 1 RGBA accumulator

//-

eqv    t0, zero, r6 // 1’s compliment of t0

minuw4 t0, r6,   t0 // get the smaller values

addq   r6, t0,   r6 // accumulate RGBA’s pixel 1

//+

// r5 pixel 2 RGBA accumulator

//-

eqv    t1, zero, r5 // 1’s compliment of t0

minuw4 t1, r5,   t1 // get the smaller values

addq   r5, t1,   r5 // accumulate RGBA’s pixel 2

A Side-by-Side Example of Alpha MVI and x86 MMX

This example is a loop used to blend pixel values.

First – A "C" code example

//+
// BlendWithoutMVI
//-
void BlendWithoutMVI( UCHAR* pInputA, UCHAR* pInputB, UCHAR* pOutput )
{

unsigned char *frontImage;

unsigned char *backImage;

unsigned char *output;

long           l_lImgSizeX;

long           l_lSizeY;

long           y;

unsigned long pixelInLine;

unsigned long x;

unsigned short usFront;

unsigned short usBack;

unsigned short usTemp;

ImgSizeX   = 720;

ImgSizeX >= 2;

SizeY      = 486;

frontImage = pInputB;

backImage = pInputA;

output     = pOutput;

y          = 0;

do {

pixelInLine = ImgSizeX;
do {

//+

// replace MVI code with loop

//-

for ( x=0;x<8;x++ ) {

usFront   = ((unsigned short)(frontImage[x] & 0x00ff));

usBack    = ((unsigned short)(backImage[x] & 0x00ff));

usFront -= usBack;

usFront *= s_ubAlpha;

usFront += 0x0080;

usTemp    = usFront;

usTemp >>= 8;

usFront += usTemp;

usFront >>= 8;

usFront += usBack;

output[x] = (unsigned char)( usFront & 0x00ff );

}

pixelInLine--;

//+

// move the pointers up to the new offset

//-

frontImage += 8;

backImage += 8;

output += 8;

} while ( pixelInLine > 0 );

y++;

}while ( y < l_lSizeY );

s_ubAlpha--;

}

Next – x86 MMX Code as inline assembler in a "C" file.

//+
// BlendUsingMMX
//-
void BlendUsingMMX( UCHAR* pInputA, UCHAR* pInputB, UCHAR* pOutput )
{

LONG             ImgSizeX = 720;

LONG             SizeY    = 486;

LONG             y;

static __int64   ROUND    = 0x0080008000800080;

static __int64 mmAlphaValue = 0x00FF00FF00FF00FF;

// l_mmAlphaValue should have A | A | A | A

__asm

{

pxor mm6, mm6 // Clear mm6...

movq mm5, mmAlphaValue // mm5 = A | A | A | A

movq mm7, ROUND

}

for ( y = 0; y < SizeY; y++) {

__asm {

mov       esi, pInputB     // esi is the front image

mov       edi, pInputA     // edi is the back image

mov       edx, pOutput     // edx is the output image

xor       ebx, ebx         // ebx = 0, pixel offset

mov       ecx, ImgSizeX    // ecx = for counter for pixel in line

shr       ecx, 2           // Processing 4 pixels at a time

jz        finisha          // Skip if no pixels to process

loopa:

movq      mm0, [esi + ebx] // mm0 = Front[0:7]

movq      mm2, [edi + ebx] // mm2 = Back [0:7]

movq      mm1, mm0         // mm1 = mm0

movq      mm3, mm2

punpcklbw mm0, mm6         // mm0 = Front[0:3]

punpckhbw mm1, mm6         // mm1 = Front[4:7]

punpcklbw mm2, mm6         // mm2 = Back [0:3]

punpckhbw mm3, mm6         // mm1 = Back [4:7]

psubw     mm0, mm2         // mm0 = Front - Back [0:3]

psubw     mm1, mm3         // mm1 = Front - Back [4:7]

pmullw    mm0, mm5         // mm0 = FB0 * Alpha

pmullw    mm1, mm5         // mm1 = FB1 * Alpha

paddw     mm0, mm7         // mm0 = FB0 * Alpha + ROUND = C0

paddw     mm1, mm7         // mm1 = FB1 * Alpha + ROUND = C1

movq      mm2, mm0         // mm2 = C0

movq      mm3, mm1         // mm3 = C1

psrlw     mm0, 8           // mm0 = C0 >> 8

psrlw     mm1, 8           // mm1 = C1 >> 8

paddw     mm0, mm2         // mm0 = C0 + (C0 >> 8)

paddw     mm1, mm3         // mm1 = C1 + (C1 >> 8)

psrlw     mm0, 8           // mm0 = Result Pixel 0

psrlw     mm1, 8           // mm1 = Result Pixel 1

packuswb mm0, mm1         // mm0 = Result [0:7]

paddb     mm0, [edi + ebx] // Add the back (Cached)

movq      [edx + ebx], mm0 // Store the result

add       ebx, 8           // Goto next pixel

dec       ecx              // decrement counter

jg        loopa

finisha:

add       esi, ebx         // Increment the pointers

add       edi, ebx

add       edx, ebx

add       eax, ebx

mov       pOutput, edx     // Store back the pointers

mov       pInputB, esi

mov       pInputA, edi

}

__asm emms // Clear the MMX Status

} // code adapted from a performance test example provided by Richard Fuoco

Now - Alpha MVI Code as an Alpha Assembler implementation


#include <kxalpha.h>
//+
// blend.S
//
// use asaxp.exe to make an .obj file from this source and link it to
// your "C" program.
//
// c:\your-command-line>asaxp blend.s
//
// ----- BlendUsingMVI ------------------------------------------------------
//
// "C" declaration
// extern void BlendUsingMVI( UCHAR* pInputA,
// UCHAR* pInputB,
// UCHAR* pOutput,
// __int64 fadeValue )
//
// Register Usage:
//
// on entry...
//
// a0 holds the address of the Back image (parameter 1)
// a1 holds the address of the Front Image (parameter 2)
// a2 holds the address of the Output buffer (parameter 3)
// a3 holds the fade value (parameter 4)
//
// FYI. The first integer parameters always go here.
// a0 - a5 is an alias for the integer registers r16-r21
//
// If these were floating point values they would go in f16-f21
// respectively.
//
// mixed parameters i.e. int, int, float
// are placed in their respective registers in order
// therefore the first two integer parameters will be in
// a0 and a1 (that is r16-r17) and the float will
// be in f18 (NOT f16) because it is the third parameter
//
// working registers
//
// t0 four bytes of the packed pixels of the front image
//
// t2 four bytes of the packed pixels of the back image
//
// t4 temporary storage
// t5 temporary storage
// t6 temporary storage
//
// t9 used as counter pixelsInLine
// it is hard coded to (720 >> 2 ) or 180 in this NTSC example
// it is divided/shifted because we consume 4 pixels (4 bytes) at a time
//
// t10 used as a loop counter (y in the "C" example) y<l_lSizeY
// this is hard coded to 486 in this NTSC example
//
// t11 used as a "C" pointer into memory
// notice right after loopa: I add the offset in t8
// to the address in a1 and store it in t11
// this yields the pointer 0(t11)
// 0(t11) reads as zero bytes off of the address in t11
//
#define byteMask 0xff
#define _ROUND_ 0x0080008000800080
#define wordMask 0xff00ff00ff00ff00
#define mviAlphaValue 0x00FF00FF00FF00FF

LEAF_ENTRY(BlendUsingMVI)

mov 486, t10 // top of for ( y=0;y<l_lSizeY )

mov _ROUND_, t6

loopy:

                   // we are just doing it as a do loop

                   // and counting down from 486 to 0

mov 720, t9       // pixels in line is 720

srl t9, 2, t9    // processing 4 pixels at a time

ble t9, finisha // test to see if we have some pixels

                   // to process I know we will it's hard coded

                   // in this example to (720 >> 2) or (180)

loopa:

ldl t4, 0(a0) // front
ldl t5, 0(a1) // back

unpkbw t4, t0 // unpack four bytes
unpkbw t5, t2 // unpack four bytes

subq t0, t2, t0 // t0 holds (front - back) [0:3]

mulq t0, a3, t0 // t0 *= s_ubAlpha
addq t0, t6, t0 // t0 += _ROUND_ = c1

addq t0, t2, t0 // add back

srl t0, 8, t0 // shift top 8 into byte position
pkwb t0, t0 // t0 holds low 4 bytes

stl t0, 0(a2)
lda t9, -1(t9)

lda a0, 4(a0)
lda a1, 4(a1)

lda a2, 4(a2)
bgt t9, loopa

finisha:

lda t10, -1(t10) // bottom of for ( y=0;y<486;x++ )

bgt t10, loopy // in this case it is a while loop

ret zero, (ra)

.end BlendUsingMVI

Additional techniques to optimize code

To get maximum performance, it is necessary to take advantage of parallelism and to optimize cache performance. Of course, one of the basic features of the Alpha processors is instruction level parallelism (ILP) or superscalar implementation. This means that multiple instructions can be issued in a single cycle. In addition, there is SIMD parallelism that is supported by MVI in the form of byte- and word-segmented instructions. This feature can be used in two ways: the CPU can do more operations per cycle or it can do the same number of operations with fewer registers. A third type of parallelism is software pipelining, which makes better use of the superscalar design.

One of the problems with using ILP is filling all of the issue slots with instructions that are ready and can be co-issued. It is often the case that different iterations of a loop will be independent or weakly dependent (the beginning of an iteration only depends on the beginning of an earlier iteration). The idea of software pipelining is to have more than one iteration of a loop executing at the same time. The iterations are initiated at constant intervals of cycles. For example, suppose the body of a loop can be decomposed into two parts A and B. The i-th iteration of the loop would be A(i)|B(i). Take the simplest case where different iterations of the loop are independent. If we rearrange the loop so that in one iteration we do B(i-1)|A(i), then we can further move code from B(i-1) so that it is executing in parallel with A(i). This has the potential for acommodating latency in both of these pieces. Startgin and stopping the loop is handled by a prologue which does A(0) and an epilogue which does B(N). An example of software pipelining can be found in the filtering code.

Prefetching

It is always important to know where your data is coming from. For example, it could be in the first level cache, the board level cache or memory. The single most important technique for improving memory performance is prefetching data. The amount of time to load data from memory can take 50-150 times longer than the time it takes to load data from the first level cache. This means that a program that is accessing a lot of data can spend most of its time waiting for the data to be loaded, and speeding up the numerical operations will have no effect.

There is another way in which accurate prefetching improves program performance. When a load is issued, one of the first things that happens is that a physical register is allocated to hold the resulting data. If the data is not in the D-cache (first level cache), then this register cannot be used for any other instruction until that value is filled. In some cases, the machine will stall on other instructions because there are no physical registers available. Prefetches do not use up any physical registers. In addition, loads or prefetches that miss in the D-cache require another resource called the missed address file (MAF). There are only eight of these in the 21264 and six in the 21164. If you issue more than eight, the extra ones will wait. This can add to the latency problem.

The following is a detailed description of how to use prefetching in your code. The prefetch instructions which will bring a cache block into the first-level cache are loads to either R31 or F31. The registers R31 for integer instructions and F31 for floating point instructions always contain the value zero regardless of what the program does. A load to one of these registers will cause the data to be fetched from memory or a second- or third-level cache if it is not in the first-level cache. The value is not actually put in a register, but it will be put in the first-level cache if it wasn’t there already. This is particularly important for the 21264. A prefetch looks like an ordinary load except the target register is either r31 or f31. There are four types of prefetch for the 21264. They are:

Instruction type            description
ldl                     data will be evicted last
ldq                     data will be evicted next
lds                     data will be modified, so mark as dirty
wh64                    the entire cache block will be modified, so
                        don't get it from memory

There is also a general rule about how far ahead to prefetch. If you have N streams of data that are being loaded or stored, then the number of cache blocks that you can prefetch in advance of a load for each stream is Min(2, 8/N). This roughly corresponds to the limitation of eight MAF's. For example, if there are two streams of data, then you should prefetch 4 blocks ahead or 256 bytes. Thus, if your code has a load:

ldq r2, 80(r1)

then you would add 80 to 256:

ldl r31, 336(r1)

Generally, the first three types of prefetch in the table above will not get you into any trouble even if you issue more than you really need. Also, they will not cause an exception if the address is not valid. However, the wh64 can have a bad side effect if you run off the end of your data because it will zero out the data at the address you give it if it is not in the L1 cache. The way to put prefetches into routines in "C" or FORTRAN is with asm's. For example, if you have a pointer which is the appropriate number of blocks ahead of the values you are fetching, then you can pass it to the assembly macro:

__asm("ldl $r31, %0", ptr);

There are some implementation differences between the 21164 and the 21264. For the 21164, the only difference between the two types is that "ldl r31" will always load the value into the first level cache, but "ldt f31" will load to the second level cache and will load to the first level cache if it does not interfere with other loads. This distinction is present in the 21164 but not the 21264. The lds is a prefetch with intent to modify, so this should be used if the program is going to read, modify, and store a new value in the same address. Ldq is a prefetch with "evict next" property. The first level cache for the 21264 is two-way associative, so there are two blocks associated with each index (cache line). The one that is normally evicted is the older of the two. This can be overridden with the "evict next" property. This is used when the data will fit in the second level cache but not in the first, so that other useful data will remain in the first level cache. The compiler will often insert these instructions into your code; however, it is always a good idea to check this by searching through the disassembly for them. Here is an example which illustrates the use of prefetching.

C code

for (I = 0; I < 10000; I++)
a[I] = a[I] * x + b[I];

This code can be written without prefetches in assembly language as

loop:

ldt f1, 0(a0)

ldt f2, 0(a1)

lda a0, 8(a0)

lda a1, 8(a1)

mult f1, f0, f1

addt f1, f2, f1

stt f1, 0(a0)

lda a2, -1(a2)

bgt a2, loop

The loop can be improved by unrolling it four times and inserting prefetches for each of the arrays.

loop:

ldt f1, 0(a0)

ldt f2, 8(a0)

ldt f3, 16(a0)

ldt f4, 24(a0)

ldt f5, 0(a1)

ldt f6, 8(a1)

ldt f7, 16(a1)

ldt f8, 24(a1)

mult f1, f0, f1

mult f2, f0, f2

mult f3, f0, f3

mult f4, f0, f4

addt f1, f5, f1

addt f2, f6, f2

addt f3, f7, f3

addt f4, f8, f4

stt f1, 0(a0)

stt f2, 8(a0)

stt f3, 16(a0)

stt f4, 24(a0)

ldl r31, 256(a1) # prefetch

lds f31, 256(a0) # prefetch

lda a2, -4(a2)

lda a0, 32(a0)

lda a1, 32(a1)

bgt a2, loop

This program illustrates the use of prefetches, but the scheduling of instructions is far from optimal. Note that the prefetch is four cache blocks ahead of the actual loads.

Cache Locality

Programs that take advantage of cache locality will make more efficient use of the cache hierarchy. There are two types of cache locality: temporal and spatial. Temporal locality is an issue if the same data is accessed more than once. Loads and stores are often the source of large delays when the data is not in the cache, so the fewer the better and the greater the temporal locality the better, since data that is in the cache at one time instant is most likely going to be there a little bit later. Ideally, when performing multiple operations on data, it is best to get one block at a time and perform all of the operations then, as opposed to accessing the data multiple times and performing a few operations each time. While the former is often more efficient in terms of use of the machine, it is generally more difficult to code and the code is more difficult to read and debug. Nevertheless, temporal locality may make a big difference for sections of the code where most of the time is spent.

Spatial locality is also related to the order of accessing data. The order in which you access data should be compatible with the order in which it is stored. The worst case occurs when two blocks are accessed sequentially so that data in the second block is not in the cache when the first block is, and when it is brought into the cache it replaces the first block before its processing is finished.

An example which illustrates temporal locality issues is the following:

for (I=0; I<32000; I++){

        b[I] = f(a[I],c1,c2,c3); // later values for b[I] will

                                 // evict earlier ones in cache

        .

        .

        .}

for (I=0; I<32000; I++){

c[I] = g(b[I],d1,d2,d3); // b[I] is no longer cached

.

.

.}

The problem in this example is that a long time passes between when each b[I] is produced in the first loop and when it is used in the second. Suppose b[I] is a floating point value, which occupies eight bytes. Also assume that the cache holds 64K bytes. When the first loop exits, it will have stored 256K bytes of data, of which only the last 64K will still be in the cache. When the second loop starts, none of the data that it loads first is in the cache. In addition, the first 64K bytes that are loaded will evict the last 64K bytes that were left by the first loop, so none of the data will be coming from the cache. One way to solve this problem would be do loop fusion, which combines the two loops:

for (I=0; I<32000; I++){

b[I] = f(a[I],c1,c2,c3); // b[I] is stored into cache

.

c[I] = g(b[I],d1,d2,d3); // b[I] is loaded from cache

.

.

.}

However, this solution may cause another problem. If the functions f and g require a lot of code, then the I-cache may not be large enough to hold both of them, so each loop iteration would require fetching them into the I-cache. It is also possible that there would not be enough registers to hold the constants and other intermediate results required for both function calls. In this case, these values would have to be stored. A better solution to the problem is cache blocking, which means that one block of the cache is processed at a time. An example of that is:

for (I=0; I<32000/64; I++){

for(J=I*64;j<I*64+63; j++){

b[J] = f(a[J],c1,c2,c3); // b[j]’s in same cache block

.}

for(J= I*64;j<I*64+63; j++){

c[J] = g(b[J],d1,d2,d3);

.}

}

Most caches are organized in 32 or 64 byte blocks. When a byte of data is requested from a cache farther down the hierarchy or memory, a block of data is fetched which contains that byte. If another byte is requested and it is in the same block as the first one, then it is already in the cache and doesn’t have to be fetched from farther away. This is spatial locality. It is most important to know the organization of the cache in order to take best advantage of its speed of access.

Alignment of Data

In this section, we give an example of how to align a block of data and how to load unaligned data without trapping. Alignment of data is a key issue when working with byte or word data. The most efficient way to fetch data is eight-bytes-at-a-time with an ldq. However, unless the data are aligned on a quadword boundary, the ldq will be trapped. The trap causes a significant slowdown. The following code excerpt aligns your data on a quadword boundary (the address is divisible by eight), so that ldq can be used without trapping:

char *p; /* pointer to the start of n bytes of data */

p = (char *) malloc(n+63);

while (((unsigned long) p & 63) != 0) p++;

Note that the above code aligns the data so that the address is divisible by 64. Thus, it is aligned with cache block boundaries as well.

Even in the basic ALPHA instruction set, it is possible to process data in parallel (more than one byte or word at a time). The MVI instructions expand these capabilities.

If the data are inherently misaligned (e.g. they are in blocks whose size is not divisible by eight), they can still be fetched in groups of eight bytes or four words. The following code is taken from the ALPHA architecture manual:

LDQ_U R1, (R11) ; Ignores va<2:0>, R1 = CBAx xxxx
LDQ_U R2, 7(R11) ; Ignores va<2:0>, R2 = yyyH GFED
EXTQL R1, R3, R1 ; R1 = 0000 0CBA
EXTQH R2, R3, R2 ; R2 = HGFE D000
BIS R2, R1, R1 ; R1 = HGFE DCBA

Note: this assumes that the effective address of R11 is such that its remainder mod 8 is 5. The value of the aligned quadword containing (R11) is CBAx xxxx, and the value of the aligned quadword containing 7(R11) is yyyH GFED.

Applications: image filtering and motion estimation

Image Filtering

Image filtering is convolution of an image with a two-dimensional mask of constants. Often, this convolution can be computed by two convolutions with one-dimensional masks. In this case, the two-dimensional mask is said to be separable. Since most masks that one encounters are separable, we will show how to use MVI in the case of a one-dimensional mask. Of course, this code can also be applied to one-dimensional signal filtering. It is assumed here that both the image and mask data are represented as bytes. This example shows how to use 16 bits to store the intermediate results, so we can only operate on 4 image pixels at a time. In addition, for some masks it is possible to avoid multiplies by doing shifts and adds. This is generally preferable since integer multiplies can have a latency of 15 cycles.

We give an example which uses a one-dimensional mask of length three and consists of the elements [M0 M1 M2]. In most cases the mask values are normalized so that the sum of their absolute values is unity. Based on this assumption, we represent the mask values as 8-bit binary fractions. The mask values are loaded into registers and kept there since they are re-used over the entire image. The general strategy is:

read in 4 bytes at a time,
unpack them into words,
multiply by M0,
shift the data and multiply by M1,
accumulate the result
shift and multiply by M2, and
accumulate the result.
pack data
store results

For simplicity, we take the case where the mask values are M0 = M2 = 0.25, and M1 = 0.5. These masks can be implemented with adds and shifts. The operations for M0 and M2 are handled by a shift at the end. The following code fragment ignores the setup for the loop and the loop control. A pipelined approach is taken here, so that data that was loaded in the previous iteration is still used in the current iteration. Thus, the steps become:

shift data from previous iteration
load 4 new bytes
unpack them into words
multiply by 2, but preserve the original data for next iteration
accumulate
round
shift (divide by 4)
pack data
store results

The following figure shows how the data is combined. In the first row, the current data is shown each word in a white square. Two words of data from the last iteration are shown in gray squares. The second and third rows are the same data shifted to the right by one and two words respectively. The purpose of the figure is to help the reader follow the code. If we double the middle row and add the three rows together, we get the desired result. Note that the general format for Alpha instructions is the first two registers are sources and the third is the target.

X X

X X

X X



loop:

sll    t0, 48, t8 # get low byte from i-1

sll    v0, 32, t9 # get low 2 bytes from i-1

ldl    v0, (a0) # read 4 bytes

unpkwb v0, v0 # unpack into word

addq   v0, v0, t0 # multiply middle by M1

srl    t0, 16, t10 # get high 3 bytes

srl    v0, 32, t11 # get high 2 bytes

bis    t8, t10, t8 # combine shifted data

bis    t9, t11, t9

addq   v0, t8, t8

addq   t8, t9, t8

addq   t8, rnd, t8 # round off

srl    t8, 2, t8 # divide by 4

pkwb   t8, t8

stl    t8, (a1)

Motion estimation

Motion estimation is used for video encoding of macroblocks of an image in a video stream. MPEG1, MPEG2, and H.263 all have block-based motion estimation as a computationally demanding task. The assumption is that the images in a stream change smoothly and most of the image will be present in succeeding images, although it may have moved a little and it may have changed a little. Based on this assumption, given a macroblock, one can search the previous image to find the most similar macroblock. If a similar one is found, then it is only necessary to encode the motion of the macroblock and the difference in the values. MVI allows the program to measure similarity of blocks very quickly. In the following code fragment, we take an 8x8 block of pixels and look for the best match in a reference image. This 8x8 block can be thought of as eight quadwords: each quadword containing eight pixels. Conceptually, the inner loop takes the difference between 8 rows of one block and 8 rows of a subimage. The basic steps are

load eight rows from each of the two blocks to be compared
compute the sum of absolute differences for each pair
accumulate the results
compare the total with the previous best match, branch if it is better

A global search would nest this inside a double loop indexed on the rows and columns of the search area of the reference or target image. One way to improve the efficiency is to search each vertical slice by simply loading one new quadword for each new iteration and re-use the seven that overlap. It is necessary to align the data from the previous image that is being matched, since most of the time, the starting point for the search will not be aligned on a quadword boundary (see section 4.4.4). The following code is scheduled in a simple way and does not incorporate the context in which it would be used. It is not the most efficient way to schedule the code. However, some care was used to issue the unaligned loads in different cycles to avoid trapping in the case that they reference the same quadword (a1 is divisible by eight), and the 21164PC (PCA-56) latency of two was used. Also, a PCA-56 can only issue one extract or shift in a cycle.

ldq_u t8, (a1)            # row 1 from previous frame

ldq_u s5, 7(a1)           # rest of row 1

extql t8, AT, t8          # row 1 alignment process

extqh s5, AT, s5          # row 1 cont'd

ldq    t0, (a0)            # row 1, reference frame, aligned

bis    t8, s5, t8          # row 1 prev

ldq_u t9, 8(a1)           # row 2 prev

ldq_u s5, 15(a1)          # don't co-issue loads

extql t9, AT, t9          # alignment process

extqh s5, AT, s5

ldq    t1, 8(a0)           # row 2 ref

bis    t9, s5, t9

ldq_u t10, 16(a1)         # row 3 prev

ldq_u s5, 23(a1)          # row 3 prev

extql t10, AT, t10

extqh s5, AT, s5

ldq    t2, 16(a0)          # row 3 ref

bis    t10, s5, t10

ldq_u t11, 24(a1)         # row 4 prev

ldq_u s5, 31(a1)          # row 4 prev

extql t11, AT, t11

extqh s5, AT, s5

ldq    t3, 24(a0)          # row 4 ref

bis    t11, s5, t11

ldq_u t12, 32(a1)         # row 5 prev

ldq_u s5, 39(a1)          # row 5 prev

extql t12, AT, t12

extqh s5, AT, s5

ldq    t4, 32(a0)          # row 5 ref

bis    t12, s5, t12

ldq_u s0, 40(a1)          # row 6 prev

ldq_u s5, 47(a1)          # row 6 prev

extql s0, AT, s0

extqh s5, AT, s5

ldq    t5, 40(a0)          # row 6 ref

bis    s0, s5, s0

ldq_u s1, 48(a1)          # row 7 prev

ldq_u s5, 55(a1)          # row 7 prev

extql s1, AT, s1

extqh s5, AT, s5

bis    s1, s5, s1

ldq    t6, 48(a0)          # row 7 ref

ldq_u s2, 56(a1)          # row 8 prev

ldq_u s5, 63(a1)          # row 8 prev

ldq    t7, 56(a0)          # row 8 ref

extql s2, AT, s2

perr   t0, t8, v0          # first of 8 perr

extqh s5, AT, s5

perr   t1, t9, s3

bis    s2, s5, s2

perr   t2, t10, s4

addq   v0, s3, v0          # latency 2 for MVI

perr   t3, t11, s3

addq   v0, s4, v0

perr   t4, t12, s4

addq   v0, s3, v0

perr   t5, s0, s3

addq   v0, s4, v0

perr   t6, s1, s4

addq   v0, s3, v0

perr   t7, s2, s3

addq   v0, s4, v0

addq   v0, s3, v0

Scheduling Issues

The order in which you issue instructions can have an effect on the speed that your code executes, even on the 21264, which has out-of-order (OoO) execution of instructions. The main factor to consider is the latency of each instruction. On an in-order machine such as the 21164PC, if an instruction is issued before its input arguments are ready, the machine will have to stall until they are ready. In the case of OoO execution, the machine may continue to execute instructions, but physical registers will be allocated for the pending instructions, and the machine may stall if it runs out of physical registers.

MVI instructions have a latency of 2 for the 21164PC and 3 for the 21264, and they can be issued every cycle. For the 21264, there is a return slot two cycles after an MVI instruction is issued, and only another MVI can be issued in this slot. In the following code fragments the main point is to pay attention to the MVI instructions and when the results are used. The other instructions are filling the remaining issue slots. Here is a code fragment written for the 21164PC:

perr t0, t8, s0 // MVI instruction

addq v0, AT, v0 // non-MVI instruction

perr t1, t9, s1 // MVI

addq a5, AT, a5 // non-MVI

perr t2, t10, s2 // MVI

addq v0, s0, v0 // use result of first perr

Here is a code fragment written for the 21264:

perr t0, t8, s0 // MVI instruction

addq v0, AT, v0 // non-MVI instruction

ldq AT, (a0) // non-MVI

sll t1, 16, t1 // non-MVI

perr t1, t9, s1 // MVI

addq a0, AT, a0 // non-MVI

ldq AT, 8(a1) // non-MVI

sll t2, 16, t2 // non-MVI

perr t2, t10, s2 // either an MVI instruction or no instruction

addq a0, AT, a0 // non-MVI

ldq AT, 16(a1) // non-MVI

sll t3, 16, t3 // non-MVI

perr t3, t11, s3 // either an MVI instruction or no instruction

addq v0, s0, v0 // use result of first perr

ldq AT, 24(a1) // non-MVI

sll t4, 16, t4 // non-MVI