Adding 16bpp pixels using MMX

Last time I presented a reasonably fast way of adding 16bpp (bit-per-pixel) pixels with integer unit. "Adding with saturation" took 13.5 clocks per pixel and "pixel average" took 5 clocks per pixel.

Nowadays all (new) computers have processors supporting MMX instruction set. What can one do with it? Let's see.

The workaround

BRAD presented a better way of adding 16bpp pixels with the integer unit. As MMX registers are 64-bit, we can take another step forward by performing on 4 16-bit pixels exclusively, which will double the efficiency of the algorithm.

Routines

The first routine does "adding with saturation" and the other "pixel average". Both routines read pixels from buffer/stream at [esi], mix them with pixels at [edi] and store results also at [edi]. The MMX specifics, i.e. lack of immediates, requires us to keep masks in memory or in unused MMX registers. It is better to keep constants in registers, since instructions that access memory can be paired only in U-pipe.

And like before, by right of each instruction I marked the cycle in which it is issued. '-' means that the instruction is paired in the V-pipe, i.e. it is issued in the same cycle as the previous instruction. Note that pairing concerns only Pentium MMX.


	; Load unused registers with constants (before the loop)
		movq	mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask
		movq	mm7, [_8410_8410_8410_8410h] ; MSB mask

	; Load 4 pixels from each buffer
		movq	mm0, [esi]	; 1
		movq	mm1, [edi]	; 2
		movq	mm2, mm0	; -
		movq	mm3, mm1	; 3

	; Isolate MSBs and the rest of bits
		pand	mm0, mm6	; -  mm6 == not-MSB mask
		pand	mm1, mm6	; 4
		pand	mm2, mm7	; -  mm7 == MSB mask
		pand	mm3, mm7	; 5

	; Do the addition
		paddw	mm0, mm1	; -

	; Create mask
		movq	mm1, mm2	; 6
		por	mm2, mm3	; -  mm2 = MSB1 | MSB2
		pand	mm1, mm3	; 7  mm1 = MSB1 & MSB2
		movq	mm3, mm2	; -
		pand	mm3, mm0	; 8  mm3 = (MSB1|MSB2)&newMSB

	; Compose result with MSBs
		por	mm0, mm2	; -

	; Finish mask creation
		por	mm1, mm3	; 9
		psrlw	mm1, 4		; 10
		paddw	mm1, mm6	; 11  mm6 == not-MSB mask
		pxor	mm1, mm6	; 12

	; Store result ORed with mask
		por	mm0, mm1	; 13
		movq	[edi], mm0	; 15 (was pipeline stall)

And pixel average:


	; Load unused registers with constants (before the loop)
		movq	mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask
		movq	mm7, [_0821_0821_0821_0821h] ; LSB mask

	; Load 4 pixels from each buffer
		movq	mm0, [esi]	; 1
		movq	mm1, [edi]	; 2

	; Create carry mask of LSB addition
		movq	mm2, mm0	; -
		pand	mm2, mm1	; 3
		psrlq	mm0, 1		; -
		pand	mm2, mm7	; 4  mm7 == LSB mask

	; Calculate average
		psrlq	mm1, 1		; -
		pand	mm0, mm6	; 5  mm6 == not-MSB mask
		pand	mm1, mm6	; -
		paddw	mm0, mm1	; 6
		paddw	mm0, mm2	; 7

	; Store result
		movq	[edi], mm0	; 9 (was pipeline stall)

Now we see, that we can do adding with saturation in 15 cycles, and pixel average in 9 cycles. Both routines process four pixels in a pass, so the first takes 3.75 cycles per pixel and the other 2.25 cycles per pixel. Compare this to BRAD's integer case! And there is even a greater speed improvement over my own integer routines from Hugi #17!

There is also a matter of cache misses. As the routines access new and new pixels, they often incur cache miss penalty. If we use the prefetch instruction (introduced in Pentium III and K6-2), the routines will be really as fast as we estimate.

Both prefetch and loop management can be realized using the remaining gaps in the pipeline, so the final routines will take about two clocks more (0.5 clocks per pixel extra).

How it works

If you are interested in detailed description of the algorithm, refer to BRAD's article - he is the author of the original method.

In opposite to BRAD I used not-MSB mask to create carry masks in the pixel average routine, and thus reduced the number of constants needed. Will it work? The table below explains it showing step by step what happens to the isolated MSB.


ﾚﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄﾄｿ
ｳ    red and blue   ｳ	   green       ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ no carryｳcarry    ｳ no carryｳcarry   ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ 00000b  ｳ10000b   ｳ 000000b ｳ100000b ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ shift right by 4  ｳ shift right by 4 ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ 00000b  ｳ00001b   ｳ 000000b ｳ000010b ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ add 01111b	    ｳ add 011111b      ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ 01111b  ｳ10000b   ｳ 011111b ｳ100001b ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ xor 01111b	    ｳ xor 011111b      ｳ
ﾃﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄﾄﾅﾄﾄﾄﾄﾄﾄﾄﾄﾄﾂﾄﾄﾄﾄﾄﾄﾄﾄｴ
ｳ 00000b  ｳ11111b   ｳ 000000b ｳ111110b ｳ
ﾀﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾄﾁﾄﾄﾄﾄﾄﾄﾄﾄﾙ

If you are not familiar with MMX, refer to Intel manuals.

Credits

Special thanks to Rawhed - he has written the original article about the topic. Also thanks to TAD. If not him, I would not write about the MMX issue. Further thanks to BRAD for his better algorithm.

Chris Dragan