Introduction

I am writing this article in response to Chris Dragan's article who wrote his in response to Rawhed's one who... Wow, this 16 bits must surely have something special!

Anyway, I read those articles and I thought that I might try a different approach, and since I've just got into assembly (I mean serious & optimized assembly stuff) posting this article to Hugi might just show me if I'm going the right way about this.

So, as you might know from the above articles everything's about adding 16 bit pixels together, and there are (mainly) two ways to do that:

1. Averaged add - when you average each RGB component of the pixel with the other pixel's one.

2. Saturated add - when you sum up each RGB component and then you do a cut-off for each one's maximum value (to avoid overflows).

Notes

a. The 16 bit pixel is assumed to be a word with the RGB components represented as "rrrrrggggggbbbbb" so a 5-6-5 bit configuration.

b. In the rest of the article the MSB/LSB notation will refer to the most/least significant bit of each RGB component of the pixel NOT to that of the word (or those of the bytes) where the pixel is stored.

c. Constants used (as you will see we will process two pixels at the same time so we need 32 bit constants):

the mask for the most signficant bit for each component;

the inverted mask for the most signficant bit for each component;

the mask for the least signficant bit for each component;

the inverted mask for the least significant bit for each component;

SAT_CT1 == 7BDF7BDFh

SAT_CT2 == 7BFF7BFFh

d. The code samples are optimized for a Pentium machine.

To begin with, today's commandment is: "Thou shalt not split the pixel into components!", and we shall respect it in what will follow.

Well this is the simplest case, because all that has to be done is a simple average. Of course we have a commandment to keep, so averaging might seem more tricky. In fact it's not difficult at all. The thing is that we have to avoid the components from overflowing into each other. To do this we will first right shift the pixel by one, then mask out the MSB of each component, and just after that add them together. It's just like (a+b)/2 becomes a/2+b/2, and that won't overflow for sure.

How about the last bit of each component? Well we must notice that only if both components have the last bit set those bits would make a difference. So, first we save only those last bits that are set in both components, and then we add them to the final result.

In assembly:

```
CPU
ticks
loop1:
1	mov eax, [esi]			; load the pixels
-	mov ebx, [edi]
2	mov ecx, eax			; save LSB's from eax
-	and eax, LSB_COMP_IMASK32	; clear them
3	shr eax, 1			; eax/2
-	and ecx, ebx			; select only common LSB's
4	shr ebx, 1			; ebx/2
-	and ecx, LSB_COMP_MASK32	; isolate common LSB's
5	and ebx, MSB_COMP_IMASK32	; clear ebx MSB's
7	mov [edi], eax			; store
8	cmp edi, MAX
-	jnz loop1
9

```
So we get 4.5 ticks per pixel (since we do two at the same time), loop management included, pairing 100%, edx freed up.

There also exists a very important tradeoff one can do, and that is to forget about the last bits. Yeah, I know it sounds like some kind of murder and you would expect the results to be completely different. In theory, on average 50% of the pixels will be darker by approx. 3.125%.

In the real world it's hardly distinguishable from the correct result; if you show someone the initial pictures and the results, he won't be able to tell which one is the 'arithmetically accurate' one - unless he's working for the company that set the Pantone color standard :) The benefits are important - only 3 ticks per pixel and two free registers, so if you also unroll...

This is a little bit more complicated, but we still can keep the commandment. The thing we must do first is ensure that the components won't overflow, by clearing their MSBs (which we have to save first).

Now, which conditions have to be met so that an overflow may occur? There are two such conditions:
a) Both components have their MSBs set;
b) After the addition the MSB is set (although we have cleared it first);

So we must do the following:
- Save the MSBs
- Clear them
- Restore cleared bits
- Check for MSBs that have been set again
- Check for common bits in the saved MSBs
- Create a mask that would saturate the 'guilty' components, using the common bits and the cleared & re-set bits.

Since we work only with the MSBs, after doing a bitwise 'or' between the common MSBs and the cleared & re-set MSBs, we shall get something like -

```
x0000x00000x0000
|    |	   |
red  green blue

x = 1 if overflow, else 0

```
- that is one MSB will be set only if that component overflows.

The method I figured for getting the mask (that would be for example 1111100000000000 for a red overflow, etc.) is to shift right four bits, add 0111101111011111b, then xor with 0111101111111111b (just try it and you will figure it out).

However, this method isn't 100% accurate - it erroneously computes the LSB of the green component in 50% of cases. But since green has the most bits and we are talking about the least significant bit - I bet not even a Pantone guy could tell the difference!

Anyway if you prefer a precise result the solution is to shift the 'middle' bit one more position to the right and to use 0111101111101111b for the add and for the xor, and by pairing you won't waste too many cycles.

Let's see my version in assembly:

```
CPU
ticks
loop1:
1	mov eax, [edi]			; load the pixels
-	mov ebx, [esi]
2	mov ecx, eax			; save MSBs
-	mov edx, ebx
3	and eax, MSB_COMP_IMASK32	; clear MSBs
4	and edx, MSB_COMP_MASK32	; isolate saved MSBs
5	and ecx, MSB_COMP_MASK32	; isolate the other MSBs
-	mov ebx, edx
6	or edx, ecx			; ecx <- all the saved MSBs
-	and ebx, ecx			; ebx <- the common MSBs
7	mov ecx, edx
-	and edx, eax			; ecx <- cleared & re-set MSBs
8	or eax, ecx			; restore the cleared MSBs
-	or edx, ebx			; create the mask
9	shr edx, 4
11	xor edx, SAT_CT2
12	or eax, edx			; saturate the pixels
-	cmp edi, MAX
13	mov [edi-4], eax		; store
-	jnz loop1
14

```
As you can see, that's 7 ticks per pixel, loop management included, 96% pairing.

Conclusions

First of all, both methods presented above have been tested and they work. :) They are very fast (suitable for realtime 'multimedia applications' :)). They have high pairing percentages, so they would make a Pentium happy, and they don't have branches so the P6 can't complain either. They have quality/speed tradeoffs to satisfy whatever necessities one might have. Since they don't rely on carries and the register pressure is low, an MMX implementation would be straightforward and highly efficient (about twice as fast) - as you should better see from Chris Dragan's article.

Closing words

I hope someone will find this stuff useful, bla, bla... Let me know if you find some (relevant) mistakes, if you do a better implementation, or if you have other (relevant) ideas. :)

Bests,