Adding 16bpp pixels using MMX

Chris Dragan

Last time I presented a reasonably fast way of adding 16bpp (bit-per-pixel) pixels with integer unit. "Adding with saturation" took 13.5 clocks per pixel and "pixel average" took 5 clocks per pixel.

Nowadays all (new) computers have processors supporting MMX instruction set. What can one do with it? Let's see.

The workaround

BRAD presented a better way of adding 16bpp pixels with the integer unit. As MMX registers are 64-bit, we can take another step forward by performing on 4 16-bit pixels exclusively, which will double the efficiency of the algorithm.


The first routine does "adding with saturation" and the other "pixel average". Both routines read pixels from buffer/stream at [esi], mix them with pixels at [edi] and store results also at [edi]. The MMX specifics, i.e. lack of immediates, requires us to keep masks in memory or in unused MMX registers. It is better to keep constants in registers, since instructions that access memory can be paired only in U-pipe.

And like before, by right of each instruction I marked the cycle in which it is issued. '-' means that the instruction is paired in the V-pipe, i.e. it is issued in the same cycle as the previous instruction. Note that pairing concerns only Pentium MMX.

	; Load unused registers with constants (before the loop)
		movq	mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask
		movq	mm7, [_8410_8410_8410_8410h] ; MSB mask

	; Load 4 pixels from each buffer
		movq	mm0, [esi]	; 1
		movq	mm1, [edi]	; 2
		movq	mm2, mm0	; -
		movq	mm3, mm1	; 3

	; Isolate MSBs and the rest of bits
		pand	mm0, mm6	; -  mm6 == not-MSB mask
		pand	mm1, mm6	; 4
		pand	mm2, mm7	; -  mm7 == MSB mask
		pand	mm3, mm7	; 5

	; Do the addition
		paddw	mm0, mm1	; -

	; Create mask
		movq	mm1, mm2	; 6
		por	mm2, mm3	; -  mm2 = MSB1 | MSB2
		pand	mm1, mm3	; 7  mm1 = MSB1 & MSB2
		movq	mm3, mm2	; -
		pand	mm3, mm0	; 8  mm3 = (MSB1|MSB2)&newMSB

	; Compose result with MSBs
		por	mm0, mm2	; -

	; Finish mask creation
		por	mm1, mm3	; 9
		psrlw	mm1, 4		; 10
		paddw	mm1, mm6	; 11  mm6 == not-MSB mask
		pxor	mm1, mm6	; 12

	; Store result ORed with mask
		por	mm0, mm1	; 13
		movq	[edi], mm0	; 15 (was pipeline stall)

And pixel average:

	; Load unused registers with constants (before the loop)
		movq	mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask
		movq	mm7, [_0821_0821_0821_0821h] ; LSB mask

	; Load 4 pixels from each buffer
		movq	mm0, [esi]	; 1
		movq	mm1, [edi]	; 2

	; Create carry mask of LSB addition
		movq	mm2, mm0	; -
		pand	mm2, mm1	; 3
		psrlq	mm0, 1		; -
		pand	mm2, mm7	; 4  mm7 == LSB mask

	; Calculate average
		psrlq	mm1, 1		; -
		pand	mm0, mm6	; 5  mm6 == not-MSB mask
		pand	mm1, mm6	; -
		paddw	mm0, mm1	; 6
		paddw	mm0, mm2	; 7

	; Store result
		movq	[edi], mm0	; 9 (was pipeline stall)

Now we see, that we can do adding with saturation in 15 cycles, and pixel average in 9 cycles. Both routines process four pixels in a pass, so the first takes 3.75 cycles per pixel and the other 2.25 cycles per pixel. Compare this to BRAD's integer case! And there is even a greater speed improvement over my own integer routines from Hugi #17!

There is also a matter of cache misses. As the routines access new and new pixels, they often incur cache miss penalty. If we use the prefetch instruction (introduced in Pentium III and K6-2), the routines will be really as fast as we estimate.

Both prefetch and loop management can be realized using the remaining gaps in the pipeline, so the final routines will take about two clocks more (0.5 clocks per pixel extra).

How it works

If you are interested in detailed description of the algorithm, refer to BRAD's article - he is the author of the original method.

In opposite to BRAD I used not-MSB mask to create carry masks in the pixel average routine, and thus reduced the number of constants needed. Will it work? The table below explains it showing step by step what happens to the isolated MSB.

    red and blue   	   green       
 no carry쿬arry     no carry쿬arry   
 00000b  10000b    000000b 100000b 
 shift right by 4   shift right by 4 
 00000b  00001b    000000b 000010b 
 add 01111b	     add 011111b      
 01111b  10000b    011111b 100001b 
 xor 01111b	     xor 011111b      
 00000b  11111b    000000b 111110b 

If you are not familiar with MMX, refer to Intel manuals.


Special thanks to Rawhed - he has written the original article about the topic. Also thanks to TAD. If not him, I would not write about the MMX issue. Further thanks to BRAD for his better algorithm.

Chris Dragan