0 - Intro

Well, although hardware rendering is now commonly used and has hopefully relieved coders with the most time-consuming part of 3D coding (polyfillers), software rendering still has, I think, its own appealing charm.

Actually, the more one uses hardware, the more one gets back to software with pleasure. Sometimes. :)

Anyway, here are some thoughts/methods about texture mapping. Not exhaustive, not even the best-optimized. Just some hints coming from experience.

1 - Definitions

When one enters the polyfiller stage of a 3D-pipeline, all the necessary computations (gradients, rasterization,...) are supposed to be done. Whether you use spans, s-buffers, AEL, or else, you're now up to drawing the texels as fast as possible. Simple as it.

So, we suppose you've set up the following:

int Scan_Start: Integer pointing to the first row to draw, on top of the poly.

int Scan_Height: number of scanlines to draw till bottom.

int Scan_Left[]: An array of size equal to the screen's height, giving the left x-ordinate of scanlines between Scan_Start and Scan_Start+Scan_Height-1 entries.

int Scan_Right[]: Same thing, but with the right x-coordinate.

float Scan_UW[], Scan_VW[], Scan_W[]: For each scanline, these arrays give the left value for any field you want to interpolate (with perspective correction) along the scanline. Here, it'll be the U-V-W texture mapping coordinates, for instance.

float dxUW, dxVW, dxW: The associated horizontal gradients for the corresponding fields. They are constant all over the poly.

Now, we're up to drawing this poly on screen. We will talk about general convex n-gons instead of just triangles, because it doesn't change anything, and doing some clipping will require being able to deal with n-gons...

2 - Some useful formulae

Just to clarify/recall the above concepts and notations, let's get back to the related maths behind gradients and polys:

Suppose that we have a field f, defined only at vertices of the polygon, and linearly interpolated everywhere else. This field can be anything you want, but linear in space: U-V mapping coordinates, Gouraud intensity, normal, transparency, etc...

When we talk about a 'linearly interpolated' field f on the poly's surface, it means that:

f( t.P1 + (1-t).P2 ) = t.f1 + (1-t).f2

with P1 and P2 being any two vertices of the poly where f respectively takes the values f1 [=f(P1)] and f2 [=f(P2)]. t is a float, in [0,1], say.

Stating that f is linear on the poly is equivalent to saying that f(x,y,z) has constant gradient. By definition, the gradient of the function is just the vector whose coordinates are df/dx, df/dy, and df/dz.

Well, it's vital that the values f1,f2,...,fn you've assigned to the n vertices of your polygon stay 'coherent' with this assumed linearity. 'Coherent' means that if you have more than 3 vertices, the value of f deduced (by linearity), from the 3 first vertices, at the position of the fourth must be equal to the f4 value you've assigned there. [Put otherwise, you must have a real hyperplane in the 4-dimensional (x,y,z,f) space.] It's not always a trivial assumption when one deals, for instance, with Gouraud illumination: computing the scalar product Normal.Light_Direction at vertice to find illumination has 99.9% chance of not being coherent on a 4-gon. Same thing hold for env-mapping, reflexion-mapping, or -even- badly clipped polygon...

Well, bad news about linearity will come once you've projected your polygon on screen. Perspective transform isn't linear, and your f quantity won't stay linear according to screen coordinates. Just have a look at the (equi-spaced) stripes on a straight road...

Projecting a (x,y,z) space point on screen amounts to something like doing X=x/z and Y=y/z (up to a scaling and translation of the z-axis). We'll use the notation w=1/z, as it's only that quantity that matters...

Let's define the function F(X,Y,z) = f(zX,zY,z). F is just the value on screen (at "pixel" X,Y) associated with the value of f at *space* point (x,y,z). The relation between triplets (X,Y,z) and (x,y,z) is just x=z.X and y=z.Y, as stated above; or X=w.x and Y=w.y if you prefer. Let's then compute the gradient of F() and try to relate it to gradient of f() (which is constant). It gives:

dF/dX = df/dx . d(zX)/dX = df/dx . z
dF/dY = df/dy . d(zY)/dY = df/dy . z

Ugh, grad(F) is just not constant, even if grad(f) is. Too bad.

But, there's only an annoying 'z' factor. We can try to remove it. Let's then compute the gradient of the function G() = 1/z.F(X,Y,z). From derivation rules:

dG/dX = d(w.F)/dX = df/dx
dG/dY = d(w.F)/dY = df/dy

Bingo! It's a constant gradient, on screen coordinates. As a bonus, G() even has the same gradient as f().

Anyway, this last property isn't very useful: Even if a constant gradient is easy to compute with something like

df/dx = [ f(x+L,y,z) - f(x,y,z) ] / L

we sure won't have the required values of f at the points (x,y,z) and (x+L,y,z). Just because we only know the f values at the poly's vertices, through the given f1,f2,..,fn. And these vertices have no reasons to be axis-aligned. So we're up to computing G's gradients otherwise, with the given vertices coordinates and the set (f1,...fn) of values.

Conclusion of all this is just: if f(x,y,z) is linear in space, f(x,y,z)/z will be linear on screen. Period.

Ok, the matter of computing gradients from vertices' values would require a text in itself with maths and hyperplanes, so I will only give one of the possibly useful formulae:

df/dX = ( df2.dY1 - df1.dY2 ) / ( dX2.dY1 - dX1.dY2 )
df/dY = ( df2.dX1 - df1.dX2 ) / ( dX2.dY1 - dX1.dY2 )

giving the screen gradient per pixel of f, once projected. Because of the above coherency in f definition, we can take any 3 vertices we want amongst the n ones of the n-gon. It ought to give the same result *if* f does not violate this required coherency. For notations, the index '1' means: "difference between the 2nd and the 1st selected vertex", whereas the index '2' is the "difference between the 3rd and the 2nd vertex". For instance: dX2 = X3-X2, df1 = f2-f1, etc...

You'll have noticed that:

1. On the denominator, you have, up to a factor 2, the (oriented) area, on screen, of the triangle (P1,P2,P3). This only needs being computed once, and can be a very useful quantity for backface culling, mipmapping selection, or even shader selection (would you like to do full-perspective-correct-Phong- bumpped-texture-mapping with a poly whose area is less than 1 pixel? :).

2. These formulae are rather stable numerically. Even for ill-defined polygons. Contrary to other 'old' formulae using 'greatest scanline length' for instance...

3. It uses (X,Y) screen coordinates, but we could also use real (x,y,z) space-coordinate. It's just a matter of choice. I rather prefer dealing with (x.w,y.w,w) projected coordinates as soon as possible. Even before 3D-clipping (since this one can be done with projected coordinates as easily as with real space coordinates). You're not even obliged to clip the poly prior to gradients computing. Just keep using the (xw,yw,w) values, even is w is negative, till the clipping stage... Anyway, beware of very small z values... [Mathematically, we're using the RP2 projective space.]

4) You can rearrange the whole to only use U and V instead of U/z and V/z. At the expense of additional fmul, it can saves some 1/z operation, and thus some fdiv...

Now that we have our gradients (dxUW,dxVW, dxW, etc.), we'll need, while drawing the poly, to step the UW and VW current values with these gradients. But remember we're only insterested in the final U and V coordinates, not UW and VW.

Deducing U and V while stepping UW, VW and W along the scanline is just a matter of computing UW/W and VW/W. This fdiv with W is too costly to be performed at each pixel, so we usually deal with spans of L pixels (L=8,16,32...) where correct U and V are computed at both ends, and interpolated in between.

For instance, the gradient in U over a distance of L pixels is given by:

`(1)      dU = L. (U.dw + dUW) / (w + L.dw)`

What a strange expression! U does appear in the formulae!

In fact, it just comes from the simplification of the raw definition of dU. Compute U at point x, compute U a point x+L, substract, divide by L, you have the gradient of U over L pixel starting from point x. Explicitly:

```(2)      dU = ( w.U + L.dUW ) / ( w+L.dw )      // <= U-value at point x+L
- ( w.U         ) / ( w      )     // <= Current U-value a point x```

Reducing (2) gives (1)... You've noticed that I now use the notation dUW instead of DxUW as in previous paragraphs, just because we'll only need X-gradients from now on. You can also try perspective correction on a LxL block basis (i.e. vertically as well as horizontally) instead. Oh well...

Indeed, (1) can be more useful than (2) for implementation. Here's why:

The 'classical' algo for doing L-pixels perspective correction reads something like:

1. I have, at point x, the values of Uw and w. Therefore I know the real U value by using: U = Uw/w.

2. Step w to point x+L: w -> w+dw

3. Launch the next 1/w fdiv in parallel while drawing previous span from x-L to x-1, whose dU gradient is known.

4. Once the fdiv is off, use (2) to compute the new dU for the next-to-draw span [x, x+L], steping Uw too...

That's perfect. Except that you need to store the 2 values for Uw, compute the gradient each time, "divide" by L, etc...

Let's then have a closer look at (1). We _define_:

DwU = U.dw + dUw

What's interesting about the DwU quantity? Well, the U gradient *per pixel* (not per L pixels) is just, following (1):

dU/L = DwU / (w+L.dw)

When we are finished with the drawing of previous span, the U coordinate will have, because of linear interpolation, changed according to:

U -> U + L.(dU/L)

How will the DwU quantity have changed? Simply replace the above variation in the definition of DwU. It gives:

DwU -> DwU + (dU/L).(L.dw)

Aha!

As seen, incrementing DwU *only* requires the gradient (dU/L) [the one you've just computed in order to draw the span] and L.dw [the actual increment of w, constant all over the poly if constant L-correction is applied on each scanline].

DwU now appears as a "w-density of U-gradient" and can avantageously be used instead of (2). The algorithm reads:

Initialization:

- With the FPU, compute initial U (=Uw/w) and V.

- From U, compute initial DwU (=U.dw+Duw) and DwV.

- Store U and V in the CPU as fixed point mapping coords. This U and V won't show anymore in the FPU. How cool.

- Only keep in the FPU the values for DwU, DwV, L.dw and current w, of course.

While drawing of previous L-span, perform the following:

- w' = w + (L.dw) (step w)

- compute z' = 1/w' (fdiv in parallel, of course)

- dU/L = DwU*z' (next span's U-gradient)

- DwU += (dU/L).(L.dw) (step DwU)
(same for dV/L and DwV... or anything else...)

I favor this algo because:

1. The actual value of L only appears in the (L.dw) FPU quantity. It's a relief for maintaining loops for various L (2,4,8,16,32...). You can even choose a different, best-fit, L for each scanline. Or even worse: dynamically change the value of L (by rescaling L.dw) along the scanline (small L when z is small, great L when z is farther...)

2. You only need to compute/store the actual U and V at start, in the CPU. Anyway, beware of the last span: you'll have to compute the exact (not linearly interpolated) U and V in order to avoid overflowing the texture's limits with incorrect interpolation of the 1/z hyperbola.

3. As a matter of fact, compared to the classical way, there's less FPU manipulations during drawing.

4. Various fixed-point formats while storing integer gradients are supported; just by multiplying by the proper constant at initialization. You can even use a different format in CPU and FPU. I personally use 8:32 in the FPU (even if this #@#@^# 'fist qword' instr does not exist), and 8:16 in the CPU.

5. It just is mine. :)

4 - Basic code example

Ok, 'nuff with L-pixels correct perpective mapping. We'll now just deal with so-called 'super-linear' mapping in 16bits colors. 'Super-linear' just means computing the correct Uw/w and Vw/w mapping-coordinates on both ends of the scanline, and just interpolate along the whole scanline instead of every L pixels. This is a good practice example when ASM-coding is at stake...

In C, it should look something like:

```        Draw_Poly()
{
int y, dy;
unsigned short *Dst;

y = Scan_Start;
Dst = Screen_Start + y*Screen_Width;

for( dy = Scan_Height; dy>0; dy--, y++, Dst+=Screen_Width )
{
float U,V, dU, dV, w, z;
int ui, dui, vi, dvi;
unsigned short *Ptr;

int Length = Scan_Right[y] - Scan_Left[y];
// If you perform some sub-pixel, zero-Length
// for scanlines *can* happen. Hence the test...
if ( Length<=0 ) continue;

w = Scan_W[y]
z = 1.0/w;        // this is the left z value. 1rst fdiv here.
U = Scan_UW[y] * z;
V = Scan_VW[y] * z;

w += dxW*Length;  // Right point
z = 1.0/w;        // 2nd fdiv
dU = ( dxUW + U*dxW ) * z;  // this is the U gradient
// for this scanline
dV = ( dxVW + V*dxW ) * z;  // ...V gradient...

// get'em in 8:16 fixed-point
ui  = (int)(  U*65536.0 );
dui = (int)( dU*65536.0 );
vi  = (int)(  V*65536.0 );
dvi = (int)( dV*65536.0 );

// ...and map'em
Ptr = Dst + Scan_Right[y];
Length = -Length;         // we're drawing from left to right
while ( Length<0 )
{
Ptr[Length] = Texture[ (ui>>16) | ((vi>>16)<<8) ];
ui += dui; vi += dvi;
Length++;
}
}
}```

NOTE: I don't know if the previous compiles or something. It's just an example, recalled from memory...

Texture should be a pointer to an unsigned short 256x256 texture (RGB 16bits for colors).

All right. This indeed needs Optimization and *then* ASM-optimization, not the reverse. :) Optimization stands for "saving little cycles dealing in priority with degenerated cases before going for the general inner-loop". Just a matter of philosophy (or taste, say), huh. I prefer first dealing with little polys before handling big ones. Well...

5 - General optimizations

I advise you to take care of spurious cases first. It's tedious but, from experience, *always* happens one time or another. :)

Some examples:

- Even if we have a special case for Length=0, what about a special case for Length=1? This case does not require any gradient computing, and will happen, on average, twice per poly. And even more for thin polys... let's save some fdiv, then.

- While we're talking about small Length values, what about using simple linear mapping for Length<4, e.g.? The same applies to scanlines whose 1/z value is not quite different on both ends... Let's use DwU/w instead of the 'real' dU then... It will hardly be noticed.

- Anyhow, you still have to perform the 1st fdiv for the first pixel, no matter how early you bail out. Well, what about some cache warm-up during this unavoidable fdiv? Something like a 'cmp [Screen],reg' won't hurt during this fdiv and will flush the destination cache.

- With the first fdiv completed you now have the real U and V coords, and are heading for the second fdiv at the end of the scanline. What about flushing the texture cache during this fdiv, by mapping the first texel, huh? Even with mipmapping, texture cache misses are painful, and it's one less to bear...

- Don't increment U and V in the inner loop for the scanline's last pixel. 2 ticks less.

- etc... etc...

Enough said. Now it's ASM show time, the root of all evil. :)

6 - Inner loop

Cycle hunting in inner-loop is important, but not as vital as it seems, since span setup (gradients, stepping, etc.) is too rather time-consuming in perpective-correct mapping. Anyway, this sure deserves a bit of care.

So, we are about to perform something like:

```        while( Span_Len<0 )
{
Texel = Texture[ Ui, Vi ]
Uf += dUf            // 16bits fractionnal part
Ui += dUi            // 8 bits integer part (add with carry)
Vf += dVf            // 16bits fractionnal part
Vi += dVi            // 8 bits integer part (add with carry)
Screen[ Span_Len++ ] = Texel
}```

We ought to aim for intruction pairing, concerning Intel's procs. Here is a (good?) ASM compromise for 16bits mapping (rather difficult to handle, compared to 8 or 32-bits texels):

``` /////////////////////////////////////////////////////////////////////
;    registers:
; eax:   dUf ||  texel
; ebx:    0  || Vi  | Ui
; ecx:    Vf ||  junk
; edx:   dVf || dVi | dUi
; ebp:     counter (negative)
; esi:         Uf  ||  junk
; edi:   screen pointer

.Loop:        ; 6 ticks (without cache-misses and mis-alignments)
mov ax,[0x12345678+ebx*2]    ; 2 ticks       (ax=Texel)
add ecx,edx                  ; (Vf+=dVf)

adc bh,dh                    ; 1 tick        (Vi+=dVi)
add esi,eax                  ; (Uf+=dUf)

adc bl,dl                    ; 1 tick        (Ui+=dUi)
inc ebp

mov [edi+ebp*2],ax           ; 2 tick        (plot pixel)
jl .Loop

/////////////////////////////////////////////////////////////////////```

Intelish constraints are rather painful here. Namely:

- adc is only pairable in the U-pipe. Beuuargh...

- Increment should preferably be stored in registers instead of memory, because 'add reg,[mem]' costs 2 cycles instead of 1 for 'add reg,reg'. Same goes for 'adc reg,[mem]'.

- Oddly enough, even if a 'mov reg,[mem]' (with reg=8 or 32bits) only costs 1 tick with proper alignment, 'mov reg16,[mem]' does not pair here. Just because prefixed (0x66) 16-bits instructions is only pairable in the U-pipe. And also because of the 1 tick penalty due to prefix decoding. Nevertheless, this decoding can be done while executing the previous instruction, provided it's a 2-tick one. Not the case here. Re-beuuuargh.

- The last 'mov [edi+2*ebp]' is prefixed too. It then deserves a U-pipe. That's why the loop doesn't end with the expected pair 'inc ebp/jl'. And now, with the 'inc ebp' before the 'mov', we get an AGI-stall. But this AGI will then merge with the following prefix-decoding. Let's call it a deal. [Don't forget to substract 2 bytes from edi before the loop, just to keep the pointing right...]

- I personally plug the texture address directly in the inner loop's first instruction (boooowww) at the very beginning of drawing. You then always have U=0 and V=0 for the first mapped texel. And no 64k-aligned textures...

- Code for PII should be designed very differently, just to take into account the 3-decoding pipelines, instruction fetches, and other branch-prediction goodies...

- Unrolling/remastering of instructions could speed this inner-loop up to something like 5 clocks/pixels. Left as a (boring) exercise... :)

7 - Conclusion

Why spending so much time on 'super-linear' mapping instead of L-pixels perspective correct mapping? Well, most of the time, this method is just perfect and won't require more. As an example, here are some statistics giving the correction rate needed to have 2 rather different scenes rendered without noticing the lack of proper perspective correction (i.e., the textures don't look like 'floating' and 'sliding'):

``` <exploram.3ds> (902542 polys drawn):
Lin:71.0%  64p:14.3%  32p:6.5%  16p:3.0%  8p:1.2%  4p:1.9%  2p:2.1%
<ssalle4.3ds> (61669 polys drawn)
Lin:12.1%  64p:18.3%  32p:12.3% 16p:18.4% 8p:20.2% 4p:8.6%  2p:10.0%```

These figures are just the percentage of polys drawn with super-linear, and correction every 64, 32, 16, 8, 4, and eventually 2-pixels perspective-correct mapping (respectively). The first scene has lots of small polys whereas the second has (Quake-like) big polys. This later does require more correction, but not that much. It's up to you to correctly choose the magic constants which make you switch to the different methods. Some tuning is needed, but it's worth the reward regarding cycle-saving.

Skal / BomB!
Pascal.Massimino@ens.fr
Pascal.Massimino@ptm.u-cergy.fr
http://www.eleves.ens.fr:8080/home/massimin/
http://bomb.citeweb.net/
Fri Feb 19 08:14:39 GMT 1999