SIMD-FP Optimisation Techniques Revised
Chris Dragan

This article is a continuation of my previous article from Hugi#20 about optimising SIMD code on x86 processors.

3DNow!

When writing 3DNow! code it is important to remember that the older K6-2 and K6-III processors didn't have the new useful instructions introduced in Athlon. It is recommended that we first optimise for K6-2, and later consider whether it is necessary to write additional Athlon-specific routines.

The K6 family, as well as all older non-Intel processors, had very weak FPUs. The weakness of the competitors' FPUs resulted from the lack of pipelining - only one FPU instruction could be executed at a time. However, K6-2's 3DNow! solution provides an excellent workaround for this problem: the 3DNow! code performs much better than any FPU code on any x86 processor. This is because all 3DNow! instructions have two-cycle latency and are pipelined. What's more, they process two FP-numbers at a time (well, that's obvious... *grin*). The division can be performed in 4-6 cycles or in 8-10 cycles, depending on the required precision (15bit or full 24bit) and on the number of divisors (1 or 2). Of course two divisions can be performed simultaneously, because of two-cycle latency of used instructions, which can provide a yet better result. Also square root is very fast with 3DNow! - it takes 4 cycles more than the division (only 2 cycles more if we need reciprocal square root). And of course we must remember that 3DNow! supports only single precision - if we want to work with a greater level of precision we are still confined to FPU. However, 3DNow!'s single precision is enough for all real-time calculations, including 3D-graphics, raytracing and DSP. In these areas we get a 2 to 4 times faster code than the usual FPU code (on Intel P5 and P6 family processors and on AMD Athlon).

Indeed, all of the above is to convince you to use 3DNow!, to make a version of your routines also for the AMD processors *grin*. Note that K6-2 processors are very cheap, yet they can provide a high performance when treated properly.

A simple guideline for K6-2 is to optimise like for Pentium, even though the K6 architecture is much more advanced than P5, and has some level of out-of-order execution. Practice shows that hard-optimised 3DNow! code (i.e. taking into account the K6-2 decoding, issuing, operand fetching and execution) usually is not faster than a simple paired code. The only noticeable difference is that we have to sweat much more when writing K6-2-specific DIOX schemes. Of course when optimising code by pairing, we have to consider instruction latencies, which are sometimes different than on Pentium.

As you probably remember from my previous article (or know from some other sources *grin*), K6 processors suffer from the two-cycle latency of loads. However, this is not a big penalty when we consider the following:

- the loads aren't performed in the execution pipelines, but by a separate load unit;

- if a load is preceded by a slow instruction or by a long dependency chain, it is performed earlier than it is needed, out of the order;

- there are no AGI stalls - the address generation phase is included in the two-cycle latency of loads;

- stores to memory usually have NO latency - they are performed as if they didn't exist.

From the K6 architecture we also get very fast immediate loads. For example the instruction "mov edx, 5" is not executed at all - its result is ready right after the decoding stage.

The K6's MMX is superior to Intel's MMX in some terms (note that Pentium II and friends have the same MMX execution units as Pentium MMX). The latency of MMX multiplications is 2 instead of 3. If the code is not tight enough, e.g. it has long dependency chains, all loads from memory are performed earlier, before they are needed. There is also no pairing limitation on loads (on Pentium MMX the MMX loads could only go to the U pipe and couldn't go in parallel with integer code). Another nice feature of K6's MMX is that pack*/punpck* instructions are executed in ALUs and not in the shifter, so there can be two unpacks executed simultaneously.

Things a little bit change with Athlon, mainly because of longer latencies of instructions. Usually this doesn't hurt much, since this processor has a very nice instruction decoding facilities and other goodies. The 3DNow! code optimised for K6-2 runs well on Athlon, and usually cannot be tuned better.

SSE

The P6 family of processors (Pentium Pro, Pentium II, Pentium III and all clones like Celeron) splits instructions into simpler ones, called micro-operations (abbreviated uops). Simple instructions generate one uop, while complex instruction generate 2 or more uops. The examples can make things clearer:



	add	eax, ebx	; 1 uop
	add	eax, [ebx]	; 2 uops
	add	[ebx], eax	; 4 uops

Addition from register to register is a simple instruction. Addition from memory to a register generates two micro-operations: one for loading the data from memory, and one for doing the actual addition. The third example is also easy to understand: 4 uops are generated - one for load, one for add and two for store. A simple store operation always generates two uops.

How the out-of-order execution works? The simplified explanation is as follows: Instructions are split into uops. Uops are stored in a reservation station and executed when possible. For example when an uop depends on the result of an already executing operation, it cannot be executed simultaneously. Example:



	add	eax, ebx	; 1st cycle
	add	ecx, eax	; 2nd cycle
	add	ebx, esi	; 1st cycle

The first and the third instructions can be executed simultaneously, and they will be. The second instruction will be executed after them, as it depends on the result of the first instruction.

A conclusion is, that the processor optimises the code for you! But things aren't as bright as they might look. The first and most important limitation we meet is during the decoding instructions into uops. The P6 family of microprocessors can decode a maximum of 3 instructions at a time, provided that the first instruction generates no more than 4 uops and the two succeeding instructions generate no more than 1 uop each, and are no longer than 7 bytes. This is called 4-1-1 decoding scheme. The instruction decoders are called d0, d1 and d2, and they decode three subsequent instructions simultaneously. If the above conditions aren't met, there are decoded only two, or in worst case one instruction per cycle. In extreme cases one instruction is decoded for multiple cycles. Examples:



	add	eax, ebx	; 1st cycle, d0 (1 uop)
	add	ecx, edx	; 1st cycle, d1 (1 uop)
	add	esi, [eax]	; 2nd cycle, d0 (2 uops)
	cmp	esi, edi	; 2nd cycle, d1 (1 uop)
	sbb	ebp, ebp	; 2nd cycle, d2 (1 uop)
	neg	ecx		; 3rd cycle, d0 (1 uop)
	pop	[edi]		; 4th and 5th cycle, d0 (8 uops)

The last instruction is decoded for two cycles, because it generates 8 uops, and no more than 6 uops per cycle can be generated. There are instruction decoding tables which specify how many uops each instruction generates. See the Intel manuals or Agner Fog's guide.

There are two kinds of SIMD-FP SSE instructions - packed (which process four numbers at a time) and scalar (which process one number at a time). The eight SSE registers are logically 128-bit wide, i.e. the user sees them like they contained four single-precision numbers. But as a matter of fact, the SSE registers are pairs of smaller 64-bit registers. This can be illustrated by the following example:



	movups	xmm0, [temp1]	; xmm0 = [ 1.0 | 2.0 | 3.0 | 4.0 ]
	movups	xmm1, [temp2]	; xmm1 = [ 5.0 | 6.0 | 7.0 | 8.0 ]
	movhlps xmm0, xmm1	; xmm0 = [ 1.0 | 2.0 | 5.0 | 6.0 ]
	movlhps xmm0, xmm1	; xmm0 = [ 7.0 | 8.0 | 5.0 | 6.0 ]

The last two instructions both write to xmm0 register. But there is no write-after-write dependency, since the high and low parts of xmm0 are physically two separate registers, so both instructions execute in parallel.

Now, let's consider two instructions: addss and addps.



	movups	xmm0, [temp1]	; xmm0 = [ 1.0 | 2.0 | 3.0 | 4.0 ]
	xorps	xmm1, xmm1	; xmm1 = [ 0.0 | 0.0 | 0.0 | 0.0 ]
	xorps	xmm2, xmm2	; xmm2 = [ 0.0 | 0.0 | 0.0 | 0.0 ]
	addss	xmm1, xmm0	; xmm1 = [ 0.0 | 0.0 | 0.0 | 4.0 ]
	addps	xmm2, xmm0	; xmm2 = [ 1.0 | 2.0 | 3.0 | 4.0 ]

The difference is noticeable. The addss instruction touches only the lower part of xmm1, and doesn't stress any dependency upon the high part. The addss instruction generates one uop. But the addps instruction accesses both sub-registers, high and low, and generates two uops, one processing the low part and one the high part. The addss instruction has a latency of three cycles - and this is the real latency of the SSE SIMD-FP addition unit. Since this unit is pipelined, the two uops of the addps instruction are issued one after the other, so the entire result is ready after four cycles. But since there is no dependency between the sub-registers, the first result, regardless of whether it is high or low part of the register, can be used by another instruction after three cycles. The same concerns also load and store instructions, so usually the addps instructions have a 3-cycle latency.

You have probably noticed that the packed versions of instructions (except movhlps, movlhps and movmskps) are always decoded by the d0 decoder, since they generate at least two uops. Therefore it is a good idea to use those instructions with memory operand, instead of issuing loads earlier, as long as the source operands are aligned. Another thing is that a chain of packed instructions performs very slowly, since there is only one instruction per cycle decoded - all go to d0. Therefore packed instructions should be intermixed with simple instructions to achieve optimal decoding speed.

At this point it is worth to mention the Athlon's 2-2-2 decoding scheme.

Latencies

The following table shows latencies of actual micro-operations. The K6-2 internally also uses micro-operations, though in a different way than P6 and Athlon.

The concerned MMX instructions are all MMX instructions that don't perform loads or stores. Loads and stores are performed differently on each processor and are not presented here. The MMX non-mul instructions are all MMX instructions that don't perform multiplication; the MMX mul instructions are: pmulhw, pmullw, pmaddwd, pmulhrw (3DNow!), pmulhuw (SSE, Athlon).

3DNow! instructions are divided into two categories: table lookup (pfrcp and pfrsqrt) and regular (all the other). The 3DNow! MMX instructions (pavgusb, pmulhrw, pswapd (Athlon) ) are treated as regular MMX instructions.

SSE instructions are divided into four categories: addition (addss, subss, maxss, minss, cmpss), multiplication (mulss), logical (xorss, andss, orss) and table lookup (rcpss, rsqrtss). Division and square root (div*s and sqrt*s) are typically very slow, and aren't presented here. The SSE MMX instructions (pavg*, pmin*, pmax*, pmulhuw, psadbw), also supported on Athlon, are treated as regular MMX instructions.

Other instructions not presented here include: pshufw, pinsrw, pextrw, maskmovq, pmovmskb, movntq, SSE conversion instructions, emms, femms, prefetch(w), prefetchnt0/1/2/a, sfence and other special instructions.

Again, the table presents latencies of u-ops, not actual instructions.

P6
K6-2 Athlon
MMX non-mul
1
1 2
MMX mul
3
2 3
3DNow! table lookup
-
2 3
3DNow! regular
-
2 4
SSE addition
3
- -
SSE multiplication
4
- -
SSE logical
2
- -
SSE table lookup
1
- -

As a side note, Athlon treats all "merged unit" MMX/SSE instructions, i.e. those which access both MMX and integer resources, as vector-decodable. Vector-decodable instructions are typically decoded one per cycle. This concerns movd to/from integer register, pinsrw, pextrw, maskmovq and pmovmskb. There is also a small drawback on Athlon - passing result from a MMX instruction to a SIMD-FP instruction and vice versa takes one additional cycle.

Chris Dragan