3dNow! Tutorial

Introduction

This article will focus mainly on explaining how to use AMD's 3dNow! Instruction set, but will also cover the AMD Athlon extended 3dNow/MMX instructions. The article is targeted for the AMD Athlon Processor, and assumes you have prior knowledge of MMX instructions. (See MMX Tutorial in Hugi 19 by Rawhed/Sensory Overload.)

Not many of today's compilers support AMD's 3dNow! instructions. You will need to get the AMD Athlon SDK from www.amd.com in order to write inline 3DNow! code in Microsoft Visual C++, or TMT pascal from www.tmt.com if you wish to write 3dNow! code in Pascal.

The AMD Athlon Processor

The Athlon processor has many added and different features from the AMD K6 and Intel processors. The points of most interest to us are that the Athlon has three integer pipes, each with an integer execution unit, and an address generation unit. The Athlon also has three Floating-Point pipes. The first performs floating-point and 3dNow! addition, and MMX ALU instructions. The second pipe performs MMX and 3dNow! multiplication, and floating-point multiply, division, square root instructions. The third pipe is the floating-point load/store unit. This allows the Athlon processor to perform both MMX and integer, or 3dNow! and integer instructions simultaneously.

3dNow! Instructions

3dNow! instructions are all pretty much formatted the same way:

instruction dest,source

Where dest is usually a mmx register, and the source an mmx register or a 64bit memory location. Like the MMX instructions, 3dNow! instructions have the prefix p (packed) followed by the 3dNow! prefix f (float).

The 3dNow! instructions cover addition, subtraction, multiplication, reciprocals, square roots, comparisons, and floating-point to integer conversions.

The 3dNow! instructions operate on the MMX registers MM0...MM7, and can be combined with MMX instructions with no context switching penalty, allowing you to perform both integer and floating-point instructions with no penalties.

MMX Extensions

The extended MMX instructions include averages, comparisons, prefetching and shuffling.

EMMS Instruction

On the Athlon processor the EMMS instruction is no longer required at all, and the Athlon also supports the FEMMS instruction, which provides a much faster context switch than the EMMS instruction.

3dNow! Data

3dNow! instructions operate on two 32bit packed floating-point numbers in one 64bit MMX register. The Athlon also supports new move instructions:

MOVNTQ: Streaming (Cache Bypass) store.
MASKMOVQ: Streaming (Cache Bypass) store using a Byte Mask.

It also supports four new prefetching instructions:

PREFETCHT0: Move data into all cache levels
PREFETCHT1: Move data into all cache levels except the 0th level.
PREFETCHT2: Move data into all cache levels except the 0th and 1st levels.
PREFETCHNTA: Move data into processor with minimal L1/L2 cache pollution.

Along with the K6's prefetching instructions:

PREFETCH: Prefetch a cache line into the L1 data cache.
PREFETCHW: Prefetch a cache line into the L1 data cache and mark it for modification.

Converting between floating-point data and integer data:

PF2ID mm0,mm1 ;converts the two floating-point values in mm1 to 32bit integer values in mm0
PI2FD mm0,mm1 ;converts the two 32bit integer values in mm1 to floating-point values in mm0
PF2IW mm0,mm1 ;converts the two floating-point values in mm1 to 16bit integer values in mm0
PI2FW mm0,mm1 ;converts the two 16bit integer values in mm1 to floating-point values in mm0

3dNow! technology always truncates when converting from floating-point to integer values.

Basic arithmetic using 3dNow!

Addition and subtraction using 3dNow! is very similar to the MMX commands. Assuming for each example that mm0 contains the values 9.0f (low) and 5.0f (high), and mm1 contains 2.0f (low) and 14.0f (high) we then find:

PFADD mm0,mm1 ;add the two floating-point values in mm0 to the two in mm1

This returns (9+2=)11.0f (low) and (5+14=)19.0f (high) in mm0.

PFSUB mm0,mm1 ;subtract the two floating-point values in mm1 from the two in mm0

This returns (9-2=)7.0f (low) and (5-14=)-9.0f (high) in mm0.

PFSUBR mm0,mm1 ;subtract the two floating-point values in mm0 from the two in mm1

This returns (2-9=)-7.0f (low) and (14-5=)9.0f (high) in mm0.

PFACC mm0,mm1 ;add the low floating point value in mm0 to the high floating-point value in mm0 and store it as the low floating-point value in mm0, and add the low floating point value in mm1 to the high floating-point value in mm1 and store it as the high floating-point value in mm0

This returns (9+5=)14.0f (low) and (2+14=)16.0f (high) in mm0.

PFNACC mm0,mm1 ;subtracts the high floating point value in mm0 from the low floating-point value in mm0 and stores the value as the low floating-point value in mm0, and subtracts the high floating-point value in mm1 from the low floating-point value in mm1 and stores the value as the high floating-point value in mm0

This returns (9-5=)4.0f (low) and (2-14=)-12.0f (high) in mm0.

PFMUL mm0,mm1 ;multiplies the two floating-point values in mm0 with the two floating-point values in mm1

This returns (9*2=)18.0f (low) and (14*5=)70.0f (high) in mm0.

3dNow! instructions do not have a specific division instruction, instead in order to divide one value by another, one must first find the inverse, and multiply that by the MMX register you wanted to divide. The inverse is first found by using a lookup table approximation accurate to 14 bits, and then using the Newton-Raphson method to find the complete 24bit reciprocal. This is achieved through the following instructions:

PFRCP mm2,mm0 ;uses the lookup table to find 14bit precise reciprocal of the high floating-point value in mm0 and stores the result in both the high and low packed floating-point values in mm2

This returns an approximation for 1/9 in mm2. So, mm2 equals 0.111109f for both its low and high values.

Assuming now that mm2 is still equal to 0.111109f low and high, we get:

PFRCPIT1 mm0,mm2 ;first intermediate step of the Newton Raphson method iteration to find 24bit result
PFRCPIT2 mm0,mm2 ;second step of the Newton Raphson method iteration to find 24bit result

Now this will only return the correct result (0.111111) in the low value of mm0, as the high value of mm0 started as 2.0f as opposed to 9.0f. This is because the pfrcp instruction only operates on the high floating-point value. The following example shows how one can find the reciprocal of both the low and high floating-point values.

Assume mm0 contains 9.0f (low) and 3.0f (high), and mm1 still contains 2.0f (low) and 14.0f (high):

PFRCP mm2,mm0 ;find 1/9 approx and store in mm2 (low&high)
PFSWAPD mm0,mm0 ;swap the two DWORDs in mm0 (we now have 3.0f(low) and 9.0f(high))
PFRCP mm3,mm0 ;find 1/3 approx and store in mm3 (low&high)
PFSWAPD mm0,mm0 ;swap the two DWORDs back (we now have 9.0f(low) and 3.0f(high) again)
PUNPCKLDQ mm2,mm3 ;this moves the low DWORD in mm3 to the high DWORD in mm2 (so mm2 is now 1/9 approx (low) and 1/3 approx (high))
PFRCPIT1 mm0,mm2 ;first intermediate iteration step
PFRCPIT2 mm0,mm2 ;final iteration step. We now have the completed inverse, 0.111111f (low) and 0.33333f (high)

;If we then wish to find mm1/mm0
PFMUL mm0,mm1 ;giving ((1/9)*2=)0.222222f (low) and ((1/3)*14=)4.66667f (high) in mm0

Square Root using 3dNow!

Finding the square root also utilises the Newton-Raphson method to find a full 24 bit result. The square root is composed of two new instructions:

PFRSQRT mm1,mm0 ;find the approximation for the square root of the low floating-point value in mm0 and store it in both the low and high values of mm1
PFRSQIT1 mm1,mm0 ;calculate the intermediate step in finding the inverse square root of the two floating point values in mm0

Assuming mm0 contains 13.0f (low):

PFRSQRT mm2,mm0 ;find the approximation of 1/sqrt(13) and store it in mm2. So mm2 now contains 0.277348 in both its low and high values.
MOVQ mm3,mm2 ;save the contents of mm2 into mm3
PFMUL mm2,mm2 ;find the square of the reciprocal approximation
PFRSQIT1 mm2,mm0 ;perform the first intermediate step in the Newton Raphson approximation for the square root.
PFRCPIT2 mm2,mm3 ;perform the final step, leaving the full 24bit accurate value of 1/sqrt(13) (=0.277350) as the low value for mm2

Brief overview of the MMX Extended instructions

The Athlon has 24 extended instructions; assuming mm0 equals FFFF010F0070079Ah and mm1 equals FF00FF100144F7A8h, the following instructions produce the following results:

PAVGB mm0,mm1 ; Average of the eight unsigned bytes in mm0 and mm1, rounded up.

This returns FF808010015A7FA1h (i.e.: (FF+FF+1)/2=FF, (FF+00+1)/2=80, (01+FF+1)/2=80, etc..).

PAVGW mm0,mm1 ; Average of the four unsigned words in mm0 and mm1, rounded up.

This returns FF80801000DA7FA1h (i.e.: (FFFF+FF00+1)/2=FF80, (010F+FF10+1)/2=8010, etc..).

PEXTRW eax,mm0,2 ; Extracts the third least significant word in mm0 into eax

This returns 010F in EAX.

PMAXSW mm0,mm1 ; Finds the four maximum signed word values from mm0 and mm1.

This returns FFFF010F0144079Ah (i.e. FFFF>FF00 ? FFFF : FF00 , etc..).

PMAXUB mm0,mm1 ; Finds the eight maximum unsigned byte values from mm0 and mm1

This returns FFFFFF100170F7A8h (i.e. FF>FF ? FF:FF, FF>00 ? FF:00, etc..).

PMINSW mm0,mm1 ; Finds the four minimum signed word values from mm0 and mm1.

This returns FF00FF100070F7A8h (i.e. FFFF
PMINUB mm0,mm1 ; Finds the eight minimum unsigned byte values from mm0 and mm1

This returns FF00010F0044079Ah (i.e. FF
PMOVMSKB eax,mm0 ; Takes the eight most significant bits in each of the eight bytes in mm0 and places them in the least significant byte of eax

This returns C1h (i.e.: FF>>7,FF>>7,01>>7,0F>>7,etc..).

PMULHUW mm0,mm1 ; Multiplies the four unsigned words in mm0 with the four unsigned words in mm1 and returns the upper word of the 32bit result.

This returns FEFF010E0000075Ah (i.e. FFFF*FF00=FEFF0100, 010F*FF10=010E01F0, etc..).

PSHUFW mm0,mm1,75 ; Shuffles the four words in mm1 into one of the 24 possible combinations and stores the result in mm0. The combinations are determined by using two of the bits in the immediate number to specify which position to replace. so 01001011 (75) would result in the second, the first, the third and the last word being placed into mm0

This returns 0144F7A8FF10FF00h.

PSWAPD mm0,mm0 ; Swaps the position of the two DWORDs in mm0

This returns 0070079AFFFF010Fh.

Examples

Example of 3dNow! complex number multiplication

Assume the mm0 register contains 2.3+4.5i, with the real part in the low floating-point value and the imaginary part in the high floating-point value, and mm1 contains 6.7+8.9i in the same format, then to find the complex number multiple:

pswapd mm2,mm0 ;mm2 now contains 2.3(high) and 4.5 (low)
pfmul mm0,mm1 ;now we compute the multiple of the two imaginary parts (2.3*6.7) (low) and the two real parts (4.5*8.9) (high)
pfmul mm1,mm2 ;now we compute the multiple of the real and imaginary, (4.5*6.7) (low) and (2.3*8.9) (high)
pfnacc mm0,mm1 ;now we have the result (-24.64) (low) and (50.62) (high)

Example - Blending two 32bit images

Assuming you have two 32bit color images pointed to by vscr1 and vscr2, a 32bit buffersize, and a destination screen dest:

mov edi,[vscr1]
mov esi,[vscr2]
mov edx,[dest]
beginning:
movq mm0,[edi+ecx] ;move 2 pixels into mm0
movq mm1,[esi+ecx] ;move 2 pixels into mm1
pavgb mm0,mm1 ;find the averages ((ARGB1+ARGB2)/2)
movq [edx+ecx],mm0 ;move the result to the destination screen
add ecx,8
cmp ecx,[buffersize]
jb beginning

Another 3dNow! example:

sqrt( (a-c)*(a-c) + (b-d)*(b-d) )
movd mm0,[a] ;load all the values
movd mm1,[c]
movd mm2,[b]
movd mm3,[d]
pfsub mm0,mm1 ;a-c
pfsub mm2,mm3 ;b-d
punpckldq mm0,mm2 ;move the low floating-point value in mm2 to the high position in mm0
pfmul mm0,mm0 ;square the two terms simultaneously
pfacc mm0,mm0 ;add the two terms in mm0 to each other, and save in both high and low positions.
pfrsqrt mm1,mm0 ;find the approximation of 1/sqrt(mm0) and store it in mm1.
movq mm2,mm1 ;save the contents of mm1 into mm2
pfmul mm1,mm1 ;find the square of the reciprocal approximation
pfrsqit1 mm1,mm0 ;perform the first intermediate step in the Newton Raphson approximation for the square root.
pfrcpit2 mm1,mm2 ;perform the final step, leaving the full 24bit accurate value of 1/sqrt(mm0)
pfmul mm0,mm1 ;calculate final value for sqrt(mm0)

Example: 2d IsoSurface (MetaBlob)

Given two 2d Coordinate positions, xp1,yp1 and xp2,yp2, and positions x, y on the 2d grid for which the values must be calculated. Assume that x and y contain the positions of two grid values, i.e.: x=(x1,x2) y=(y1,y2)

The algorithm:

a=sqrt( (x-xp1)*(x-xp1) + (y-yp1)*(y-yp1) )
b=sqrt( (x-xp2)*(x-xp2) + (y-yp2)*(y-yp2) )
c=sqrt(a*b)

3dNow! Code:

movd mm0,[x]
punpckldq mm0,mm0 ;copy x to the high value of mm0
movd mm1,[y]
punpckldq mm1,mm1 ;copy y to the high value of mm1
movd mm2,[xp1]
movd mm3,[xp2]
movd mm4,[yp1]
movd mm5,[yp2]
punpckldq mm2,mm3 ;place xp2 in the high value of mm2
punpckldq mm4,mm5 ;place yp2 in the high value of mm4
pfsub mm0,mm2 ;subtract: x-xp2,x-xp1
pfsub mm1,mm4 ;subtract: y-yp2,y-yp1
pfmul mm0,mm0 ;square both x results
pfmul mm1,mm1 ;square both y results
pfadd mm0,mm1 ;add x results to y results. (i.e. mm0=(x-xp2)*(x-xp2) + (y-yp2)*(y-yp2), (x-xp1)*(x-xp1) + (y-yp1)*(y-yp1)
pfrsqrt mm1,mm0 ;find the approximation of 1/sqrt for the low value and store it in mm1
pswapd mm0,mm0 ;swap the positions inside mm0
pfrsqrt mm2,mm0 ;find the approximation of 1/sqrt for the old high value and store it in mm2
pswapd mm0,mm0 ;restore mm0 to its original state
punpckldq mm1,mm2 ;move the high value estimate to the high position in mm1
movq mm3,mm1 ;save a copy of mm1
pfmul mm1,mm1 ;find the square of the reciprocal approximation
pfrsqit1 mm1,mm0 ;perform the first intermediate step in the Newton Raphson approximation for the square root.
pfrcpit2 mm1,mm3 ;perform the final step, leaving the full 24bit accurate value of
pfmul mm0,mm1 ;we have now find a and b
pswapd mm1,mm0 ;flip the two floating-point values in mm0 and store them in mm1
pfmul mm0,mm1 ;multiply a by b to give c in mm0

Closing words

The examples given in this article were not fully tested and so may contain some errors. They are not the fastest way of doing things, as they are intended to be easily understood. For future versions and corrections of the article, along with other tidbits, check www.rmmdesign.com.au.

Good luck!

Adrian Boeing

MOVNTQ:	Streaming (Cache Bypass) store.
MASKMOVQ:	Streaming (Cache Bypass) store using a Byte Mask.

PREFETCHT0:	Move data into all cache levels
PREFETCHT1:	Move data into all cache levels except the 0th level.
PREFETCHT2:	Move data into all cache levels except the 0th and 1st levels.
PREFETCHNTA:	Move data into processor with minimal L1/L2 cache pollution.

PREFETCH:	Prefetch a cache line into the L1 data cache.
PREFETCHW:	Prefetch a cache line into the L1 data cache and mark it for modification.

PF2ID mm0,mm1	;converts the two floating-point values in mm1 to 32bit integer values in mm0
PI2FD mm0,mm1	;converts the two 32bit integer values in mm1 to floating-point values in mm0
PF2IW mm0,mm1	;converts the two floating-point values in mm1 to 16bit integer values in mm0
PI2FW mm0,mm1	;converts the two 16bit integer values in mm1 to floating-point values in mm0

PFRCPIT1 mm0,mm2	;first intermediate step of the Newton Raphson method iteration to find 24bit result
PFRCPIT2 mm0,mm2	;second step of the Newton Raphson method iteration to find 24bit result

PFRCP mm2,mm0	;find 1/9 approx and store in mm2 (low&high)
PFSWAPD mm0,mm0	;swap the two DWORDs in mm0 (we now have 3.0f(low) and 9.0f(high))
PFRCP mm3,mm0	;find 1/3 approx and store in mm3 (low&high)
PFSWAPD mm0,mm0	;swap the two DWORDs back (we now have 9.0f(low) and 3.0f(high) again)
PUNPCKLDQ mm2,mm3	;this moves the low DWORD in mm3 to the high DWORD in mm2 (so mm2 is now 1/9 approx (low) and 1/3 approx (high))
PFRCPIT1 mm0,mm2	;first intermediate iteration step
PFRCPIT2 mm0,mm2	;final iteration step. We now have the completed inverse, 0.111111f (low) and 0.33333f (high)

PFRSQRT mm1,mm0	;find the approximation for the square root of the low floating-point value in mm0 and store it in both the low and high values of mm1
PFRSQIT1 mm1,mm0	;calculate the intermediate step in finding the inverse square root of the two floating point values in mm0

PFRSQRT mm2,mm0	;find the approximation of 1/sqrt(13) and store it in mm2. So mm2 now contains 0.277348 in both its low and high values.
MOVQ mm3,mm2	;save the contents of mm2 into mm3
PFMUL mm2,mm2	;find the square of the reciprocal approximation
PFRSQIT1 mm2,mm0	;perform the first intermediate step in the Newton Raphson approximation for the square root.
PFRCPIT2 mm2,mm3	;perform the final step, leaving the full 24bit accurate value of 1/sqrt(13) (=0.277350) as the low value for mm2

pswapd mm2,mm0	;mm2 now contains 2.3(high) and 4.5 (low)
pfmul mm0,mm1	;now we compute the multiple of the two imaginary parts (2.36.7) (low) and the two real parts (4.58.9) (high)
pfmul mm1,mm2	;now we compute the multiple of the real and imaginary, (4.56.7) (low) and (2.38.9) (high)
pfnacc mm0,mm1	;now we have the result (-24.64) (low) and (50.62) (high)

movq mm0,[edi+ecx]	;move 2 pixels into mm0
movq mm1,[esi+ecx]	;move 2 pixels into mm1
pavgb mm0,mm1	;find the averages ((ARGB1+ARGB2)/2)
movq [edx+ecx],mm0	;move the result to the destination screen

pfsub mm0,mm1	;a-c
pfsub mm2,mm3	;b-d
punpckldq mm0,mm2	;move the low floating-point value in mm2 to the high position in mm0
pfmul mm0,mm0	;square the two terms simultaneously
pfacc mm0,mm0	;add the two terms in mm0 to each other, and save in both high and low positions.
pfrsqrt mm1,mm0	;find the approximation of 1/sqrt(mm0) and store it in mm1.
movq mm2,mm1	;save the contents of mm1 into mm2
pfmul mm1,mm1	;find the square of the reciprocal approximation
pfrsqit1 mm1,mm0	;perform the first intermediate step in the Newton Raphson approximation for the square root.
pfrcpit2 mm1,mm2	;perform the final step, leaving the full 24bit accurate value of 1/sqrt(mm0)
pfmul mm0,mm1	;calculate final value for sqrt(mm0)

punpckldq mm2,mm3	;place xp2 in the high value of mm2
punpckldq mm4,mm5	;place yp2 in the high value of mm4
pfsub mm0,mm2	;subtract: x-xp2,x-xp1
pfsub mm1,mm4	;subtract: y-yp2,y-yp1
pfmul mm0,mm0	;square both x results
pfmul mm1,mm1	;square both y results
pfadd mm0,mm1	;add x results to y results. (i.e. mm0=(x-xp2)(x-xp2) + (y-yp2)(y-yp2), (x-xp1)(x-xp1) + (y-yp1)(y-yp1)
pfrsqrt mm1,mm0	;find the approximation of 1/sqrt for the low value and store it in mm1
pswapd mm0,mm0	;swap the positions inside mm0
pfrsqrt mm2,mm0	;find the approximation of 1/sqrt for the old high value and store it in mm2
pswapd mm0,mm0	;restore mm0 to its original state
punpckldq mm1,mm2	;move the high value estimate to the high position in mm1
movq mm3,mm1	;save a copy of mm1
pfmul mm1,mm1	;find the square of the reciprocal approximation
pfrsqit1 mm1,mm0	;perform the first intermediate step in the Newton Raphson approximation for the square root.
pfrcpit2 mm1,mm3	;perform the final step, leaving the full 24bit accurate value of
pfmul mm0,mm1	;we have now find a and b
pswapd mm1,mm0	;flip the two floating-point values in mm0 and store them in mm1
pfmul mm0,mm1	;multiply a by b to give c in mm0