Optimizing For The Coprocessor

Dario Phong

01 - Introduction
02 - Pairing
03 - Overlapping
04 - An example
05 - Loading
06 - Fmul
07 - Fdiv, fsqrt
08 - Fstp
09 - Another example
10 - Control Word
11 - Dependency chains
12 - Inverses
13 - Counting clocks
14 - Last example
15 - Concepts
16 - Closing words
17 - Contacting

01 - Introduction

Nowadays almost every coder knows how the coprocessor works. If there are any people who still don't know how the coprocessor works, then contact me and maybe I will do an article about it. Now it's time to optimize.

If you got interested in this subject one day but didn't find any document that could enlight the way, then this is yours.

If you have already read about this topic and experimented with it, then better don't read this doc 'coz it would be like reading the same newspaper twice. ;-D

At first I want to remind you that improving the algorithm is usually better than optimizing all the code. Let's imagine we have a formula for rotating two points in 2D:

        U = x*cos(alpha) - y*sin(alpha)
        V = y*cos(alpha) + x*sin(alpha)

We could do first the sin and cos and then the rest:

        - Do the sin(alpha)
        - Do the cos(alpha)
        - Multiply
        - Add and sub

And if we have a look to the timing of every instruction we will see that the add, the sub and the mul take only 3 clock cycles while the cos and sin take 65! So doing them twice isn't very funny! E-)

I will not explain any more here 'coz it will be explained in section 04. If you don't understand something at some passage, then go to section 15. Oh! And all the examples are in ASM. If you still think that your 32-bit C++ compiler can do the best job, then be disappointed, use in-line ASM or only pure ASM.

02 - Pairing

I suppose that you have already optimized for the Pentium, so you know what pairing means. Pairing is when two instructions are executed in the same clock cycle. It's the work of two pipelines, the U pipeline (the better one) and the V one (the other one ;-). (They are called U and V like the coordinates in a texture. ;-)

If you expect doing a lot of pairing, then be disappointed. With the FPU, pairing is little but good.

Those are the pairable instructions in the U-pipe:

 - Fld fadd fsub fmul fdiv fcom fchs fabs

And this one, only ONE, is pairable in the V-pipe:

 - Fxch (What a dangerous one! ;-)

Well, from this point on I will assume that you have the guide for optimizing for Pentium by Agner Fog. If you don't have it, download it from the Internet. Let's take the fadd - it takes 3 clock cycles - and the fxch (1 clock cycle):

        fadd            ; 1-3
        fxch            ; 1
        ****            ; COPRO inst.

If the next instruction to fxch isn't a FPU instruction, pairing is not possible:

        fsub            ;1-3
        fxch            ;2 No pairing
        mov ax,1        ;3

That's roughly all about pairing. Well there are some exceptions, but we will get to see them in other sections. If you ask why fxch is so quick and can start after the first instruction has finished: This is so 'coz it doesn't really swap the contents of the two registers, it only renames them.

03 - Overlapping

Maybe pairing wasn't very good at all, but here we have something better, the overlapping. Don't you remember that the FPU is independent of the CPU? In this way the two of them can work at the same time.

Overlapping is when one instruction is still being executed and another one that does not depend on the result of the first one starts.

Here are some instructions so that you can understand the examples:

        Instr.          Clocks.         Ov. Int         Ov. Fpu Pairable
        fadd            3               2               2       yes
        fsub            3               2               2       yes
        fmul            3               2               2       yes
        fdiv            39              38              2       yes
        fimul           6               2               2       no
        fld             1               0               0       yes
        fild            3               2               2       no
        fnop            1               0               0       no
        fstp            2               0               0       no
        ...
 Ov. Int -> It can overlap with a CPU instruction (integer inst.).
 Ov. Fpu -> It can overlap with a FPU instruction.

For example, fadd takes 3 clock cycles. In the first cycle it starts, and the others can be overlapped with other FPU and integer instructions, but if the next instruction needs the result of the add, there will be a stall till fadd has finished.

        fadd            ;1-3 starts in the 1th clock and has 2 clocks of ov.
        fsub            ;2-4 the fsub starts after the fadd has finished

Another example:

        fmul            ;1-3 two ov. clocks
        fxch            ;1 pairing
        fsub            ;2-4 ov.
        mov ax,1        ;3 int inst. ov.
        fmul            ;4-6 FPU ov.

Those examples were only for showing the basics of overlapping. In fact you have to care about stalls:

        fld  [inc]      ;1 we load a 32-bit floating point data
        fadd [x]        ;2-4 we add something to it
        fmul [y]        ;5-7 oh! no overlapping

When the fmul has to be executed, it has to wait for the fadd result.

        fld  [inc]      ;1
        fadd [x]        ;2-4
        fstp [result]   ;5-6 (to mem32)

The store needs the result of the fadd, so it has to wait. Take a look at section 08 for serious info about stores.

04 - An example

Now some practice with the formulas we used in the introduction:

        U = x*cos(alpha) - y*sin(alpha)
        V = y*cos(alpha) + x*sin(alpha)

First let's talk about methodology. That's when we do the same every day, so we will do the same on the next day, too, for example going to the university. For coding with the FPU, I strongly recommend to do something like that:

        Inst.   operands       ; timing - fpu state - explanation

It may seem a little bit time consuming, but if you have a bug in an one-page code, it will be very useful.

fld     alpha           ;1 - alpha - the angle (alpha=alpha)
fsincos                 ;2-90 - cos alpha sin alpha - we do both
fst     pre_cos         ;91-92 - cos alpha sin alpha - we need it twice
fxch                    ;91 - sin alpha cos alpha - pairing
fst     pre_sin         ;92-93 - sin alpha cos alpha - we save it
fmul    y               ;94-96 - y*sin alpha cos alpha - multiply
fxch                    ;94 - cos alpha y*sin alpha - more pairing
fld     pre_cos         ;95 - cos alpha cos alpha y*sin alpha - we restore it
fxch                    ;95 - cos alpha cos alpha y*sin alpha - we get the already
                        ;loaded
fmul    x               ;96-98 - x*cos alpha cos alpha y*sin alpha -
fld     pre_sin         ;96 - sin alpha x*cos alpha cos alpha y*sin alpha - the sin
fxch    st(1)           ;96 - cos alpha x*cos alpha sin alpha y*sin alpha - pairing
fmul    y               ;97-99 - y*cos alpha x*cos alpha sin alpha y*sin alpha -
fxch    st(3)           ;97 - y*sin alpha x*cos alpha sin alpha y*cos alpha -
fsub                    ;99-101 - (x*cos alpha-y*sin alpha ) sin alpha y*cos alpha
                        ;- stall in clock 98 - there still was the fmul
fxch                    ;99 - sin alpha (x*cos alpha-y*sin alpha ) y*cos alpha
fmul    x               ;100-102 - x*sin alpha (x*cos alpha-y*sin alpha ) y*cos alpha
fxch                    ;100 - (x*cos alpha-y*sin alpha ) x*sin alpha y*cos alpha
fstp    u               ;103-104 - x*sin alpha y*cos alpha - stall! from fsub
fadd                    ;105-107 - (x*sin alpha+y*cos alpha) - no stall
fstp    v               ;108-109 - - stall from fadd

Someone said that the code would become more complicated? ;-D Maybe it has some bugs but I never tried to compile that nor debug it, so it may work or not (grin). See how the fxchs are always free. I used 7 of them. That's 'coz a lot of instructions use st(0) and st(1).

Also see that I neither load x nor y - they are read directly from memory. Another thing is that we only calculated the sin and cos once.

And now a very bad idea: storing the sin and cos and then reloading them. See section 06 for more about that.

The code isn't totally optimized. Although we saved some stalls, new ones happened, and thus we 'lost' 5 clock cycles.

05 - Loading

This section is a bit short because it only deals with the format of the data. Integer instructions usually take 6 clock cycles, compared to the 3 clocks of the floating point ones, and they are neither pairable nor can be overlapped. So that's why I recommend to use floating point. In the previous example the last stores were FP ones. This couldn't be possible in 'real' code, 'coz you will need them for any register.

Remember: It's better for speed (and that's the subject of this article ;-) to have all the data in fp-32bits and when needed (but try to avoid it) you convert it to integer-16bits so that you can use them.

06 - Fmul

Did I warn you from some exceptions? ;-D Here's one of them.

When you have 2 consecutive fmuls, they can't overlap the first clock cycle of the first fmul:

        fmul            ;1-3
        fmul            ;2-4 stall: 3-5

To avoid that, you may put any instruction in between, but be careful, 'coz the fxch pairs with the first fmul and so you will have a stall:

        fmul            fmul             fmul
        fxch            fnop             mov ax,1
        fmul stall!     fmul no stall    fmul no stall

In that way you can use those clock cycles for other instructions or preloading data in the cache, be imaginative! E-)

Maybe you are asking why they can't overlap the first ov.-legal clock. That's why the hardware (the FPU) isn't as good as it could be, just like the V-pipe restrictions. We expect solutions for both problems from Intel.

Now be careful. I will now talk about something which I didn't know until some days ago: If you put "fmul" in your source, the assembler will usually generate "fmulp st(1),st(0)" (opcode: DEC9h), so st(0) will be popped. That is a bad idea if you need it again. If you don't want st(0) to be popped, write write the following instead: "fmul st(1),st(0)" (opcode: DCC9h), and no data will be popped (result in st(0)).

An interesting detail: Everybody knows that x*y = y*x. OK? Well, Intel's wonderful engineers apparently DON'T know that. If you write "fmul st(1),st(0)", the opcode DCC9h will be created, but if you write "fmul st(0),st(1)", it will produce opcode D8C9h!

Well, now remembering the problem when you need a data twice, you may do something like writing the full instruction, so it doesn't pop anything. Or you can load the same data again:

        fld y           ;1 - y x
        fxch            ;1 - x y -
        fld st(1)       ;2 - y x y - we load y again, but from a register

As easy as that, you could store it (fst) to another register, but this will be slower.

07 - Fdiv, fsqrt

Hey, we have two exceptions. E-) Well, let's remember the timing of fdiv. It takes 39 clocks, 38 integer overlapping, 2 FPU overlapping.

Wait a moment, this needs a little explanation. Fdiv can take 19, 33, or 39 clocks. This is based on the precision: 24, 53, or 64 bits respectively. You can change that with the control word (see section 10).

Well, if you take a look at the instruction fdiv, you see that it has all the clocks except the first for int-overlapping. But on the other hand it has only 2 clocks for FPU-overlapping. However, which of them? The first? Random ones? No, the last ones.

So there will be a stall, look:

        fdiv            ;1-39
        fxch            ;1
        fsub            ;38-40 stall

So let's fill our clock cycles:

        fdiv            ;1-39
        fxch            ;1 pairing r0oLz!
        mov ax,1        ;2 Int inst.
        bsr bx,dx       ;3-26 approx. it's so slow (makes log2(dx))
        fadd            ;38-40 stall (we only lost 11 clocks in the stall)

The trick lies in the fact that FPU instructions will cause a stall but int instructions won't, so we use as many instructions as possible. Remember that if you don't have anything else to do you can read some variables only for having them in the level 1 cache.

Fdiv, fidiv, fsqrt, and fptan can't overlap with the mul (CPU) instruction.

Instructions that take a lot of clock cycles like ftan, fsqrt, etc. will cause the same exception as fdiv.

08 - Fstp

This section deals with stores. Forget what I said about stores before; the real info comes now. OK? I do that 'coz I think it isn't a good idea to put all the theory first and then some code. I prefer first explaining the basics and then advancing to more complicated things.

Fact: the FPU is quicker at loading than storing. Why? Because of some weirds things with the hardware. But what's the matter? Making loading quicker, so that you don't have to load all your data before. Example:

        fld     x       ;1
        fadd    y       ;2-4

This is better than:

        fld     x       ;1
        fld     y       ;2
        fadd            ;3-5

But everything has a drawback. What's the drawback here? Storing is a tad slow. Fst needs the data to be stored, one clock after it's been written, or it gets a 1-clock stall, time for loading the data internally. Example:

        fld     x       ;1
        fadd    y       ;2-4
        fstp    result  ;6-8 (if fistp 6-11!)

Theorically this fstp has to start in the clock cycle 5, but in this cycle the data has recently been computed and thus fstp has a stall, so it really starts in the clock cycle 6. A solution for that would be:

        fld     x       ;1
        fadd    y       ;2-4
        cmp     data,eax ;3-4
        mov     eax,ebx ;5
        fstp    result  ;6-8 now no stall

So we loaded in the cache (data) and did some int instructions. We used all the clocks that would normally be lost due to stalls.

09 - Another example

In this example we will use everything learned till now. But the most important thing is that you understand how to optimize for FPU. What you have to do is to fill the time between an instruction and stall with other instructions so that you never lose clocks cycles due to stalls. Another concept is pairing fxch, which is free for us and helps a lot with overlapping. Rembemeber, if any instruction needs the result of the previous one, they can't overlap.

Now let's take a look at a simple formula: x=xý y=yý => result=ûx+y

I don't know if that formula is used anywhere, but hey! This is only an example. ;-)

        fld     x       ;1-x- we get x
        fmul    st(0),st(0);2-4-xý-remember: x*x = xý Ok?
        fld     y       ;3-y xý-now y                           [0]
        fmul    st(0),st(0);4-6-yý xý-the same                  [1]
        fxch    st(1)   ;4-xý yý- Exchange it now               [2]
        fld     st(0)   ;5-xý xý yý- copy xý in st(0)           [3]
        fxch    st(1)   ;5-xý xý yý- Exchange them (free)       [4]
        fstp    x       ;6-7-xý yý-store xý                     [5]
        fadd    st(0),st(1);8-10-(xý+yý) yý- add them           [6]
        fxch    st(1)   ;8-yý (xý+yý)-for saving it
        fstp    y       ;9-10-(xý+yý)-save yý                   [7]
        fsqrt           ;11-80-û(xý+yý)- very slow! E-(
        ...                                                     [8]
        fstp    result  ;81-82--stall                           [9]

 [0] Here we solutioned the two consecutive fmuls with a fld.
 [1] We  can  store it right now, but we  don't 'coz it will create a BIG stall
     (stall: 8-9. Not a good idea - losing 3 clocks.)
 [2] We  exchange it now, so that we don't have  to do it in [5], 'coz there it
     will not pair (fstp isn't pairable) and we would lose 1 clock.
 [3] The xý is now stall-free 'coz fmul finished in clock 4.
 [4] We  could start the fadd now but the fmul[1] is still working and we could
     have  a  stall, in that way no. Maybe you  are asking why we do an fxch to
     data  that  is  equal. That's 'coz the  fstp  needs the one clock after it
     stores it, so the recently loaded st(0) could produce a stall but st(1) is
     just in clock 6 free from fstp's stalls.
 [5] Remember  fstp neither pairs nor overlaps.  Some people prefer putting the
     fstp  at  the end of the code, so  that they don't suffer from any stalls,
     but  I put it here 'coz the data was 'free', and so the last fmuls finish.
     Good idea, don't you think so?
 [6] In  this  clock the fmul[1] has finished,  so  there is no stall. We could
     store it with something like a fst (without pop) now.
 [7] Remember  that  fstp doesn't have a  stall, nor the next instruction, 'coz
     the previous fadd finished in clock 10, and it starts in clock 11.
 [8] Remember  that  you  have  69 (random  number  ;-)  clocks for putting all
     possible INT instructions there. (Take advantage of this!)
 [9] I  suppose that you couldn't profit from the 69 clocks, so there's a stall
     that make us lose 2 clocks.

Wow, what a piece of code. I'm proud of it. OE-) Only one stall at the end. The rest works perfectly. In theory this code takes 82 clocks, but a testing program (see 'Counting clocks') displays that it takes 105 clocks. Misaligned data? Memory access slow? Interrupts? Who cares! (If anybody knows why please let me know, thanx [email me].) The problem of this code is the terribly slow fsqrt, it takes 87,8% of it! Solutions? Should one emulate the sqrt? Do we have to use precalculated tables? As I use to say, if anybody knows the solution, let us know.

10 - Control Word

'Coz I wanted to write a good article (do you think I've succeeded?) I was playing with the control word, 'coz I didn't have a lot of info about it.

The purpose of the control word is to change some things. You can read it, so that you know what the configuration of the FPU is, and write to it in order to update the configuration. There is also the status word, which can only be read and reports the status of the FPU. It's something like the flags in the CPU. If you have Turbo Debugger, you will see that the debugger allows you to change the status word! "Why?" you may be asking. This is 'coz when you want to read the status word, you don't really do so. Remember that you are under a debugger. It returns the data to you that it wants to return. Don't care too much about it. That's how you read the status word:

        fstsw   ax      ;must be ax
        fstsw   word_   ;to memory

Now you can interpret the results of the last FPU instruction and if any exception happened. Let's play a little with the control word.

        fstcw   word_   ;only to memory
        and     word_,x ;modify it
        or      word_,y ;modify it more
        fldcw   word_   ;now the fpu will work with the new options

Here's a little description; if you need more info, then check Intel's documentation.

 Bit Name   Description
 ------------------------- LSB --------------------------------------
 0 - IM     Invalid instruction (M means mask)
 1 - DM     Invalid number (denormal)
 2 - ZM     Zero division
 3 - OM     Result too big (overflow)
 4 - UM     Result too little (underflow)
 5 - PM     Result needs more precision than currently selected
 6 -        Reserved, always 1
 7 -        Reserved, always 0 (I deduced both, so I can be wrong)
 ------------------------- MSB --------------------------------------
 0-1 - Pc   Precision,  0=24 (single)  1=reserved
                       10=53 (double) 11=63 (extended)
 2-3 - Rc   Rounding control (truncate, etc.)
 4 - Ic     Infinite control
 5-7 -      Reserved, 0

Once I read (in an article from Astharoth^Tlotb in PhyMosys mag #7) that if you masked the FPU instructions with exceptions, that could make exceptions take less time. I tried it, set all flags to 0, and ran it... The used code was pseudo-random E-), prepared for stack overflow.

With all the flags to 1 it took 964 clocks.
With the flags to 0 it took 1461 clocks!

I don't know if, when he said 'mask it', he meant setting them to 1, or was that 'coz this trick doesn't work with stack overflow. E-? If anybody knows it, please contact me.

11 - Dependency chains

This idea is easy; it's when some instruction needs the result of the previous one. This makes problems with the CPU, with the FPU it will make some problems with the overlapping, 'coz all the instructions need the result, and there are lots of stalls:

        fld     x       ;1- Formula: (x+y)*b*c
        fadd    y       ;2-4
        fmul    b       ;5-7-stall: 2 clocks
        fmul    c       ;8-10-stall: 2 clocks

Each instruction produces a stall. Not a good idea.

A solution would be to use more registers (st(0),st(1)) if needed, so that you can avoid some stalls.

        fld     x       ;1- chain 1
        fadd    y       ;2-4-c1
        fld     b       ;3-c2
        fmul    c       ;4-6-c2
        ...             ;insert any inst. if possible
        fmul            ;7-9-we join them

Another example would be to write a result to the memory (fstp) and in the next instruction try to read from it. This would cause a 1-clock stall.

12 - Inverses

I think the inverses are interesting and useful. This is a kind of algorithm optimization that is very useful with the hardware limitations.

The idea is replace a div with a mul. How? First you do the inverse of the number that is to be divided, and then you can do the mul instead of the div:

        fld     z       ;1-z-the number that divides
        fld1            ;2-3-1 z-load 1
        fdiv            ;4-42-inverse-stall, div-> 1/z
        ...             ;you had better fill that ;-D
        fld     x       ;41-x inverse-the number to be divided
        fmul            ;43-45-(x/z)-then the 'div'

Maybe you wonder how this works? Let's imagine we have to calculate 10/2. Then we get to do the inverse: 1/2=0.5. We emulate the div: 10*0.5=5 = 10/2=5. OK? Maybe a good idea would be to create a precalculated table of fp32 (not INT!) and then instead of div just to do mul. Isn't that funny? Remember the clocks: fmul-3, fdiv-39, mul-11, div-41.

13 - Counting clocks

In all the examples we assumed the clocks to be constant, no matter if the data was misaligned, or something wrong happened to the processor. E-) Well, the best thing we can do is to check the 'real' clocks. I found a solution in Agner Fog's document about optimzing. Go for it. Testing you code is easy, just follow the comments there.

14 - Last example

I didn't plan to do this, but I saw that by writing this article, I've learned a lot, so I tried it.

An article by tryx (Hugi #12, bonus) contained a little 'introduction' to FPU optimizing. He optimized a large piece of code and said that he would pay a beer to anyone who beats his code. This was the C++ code:

        void V3D::Add (V3D &a, V3D &b, V3D &c) {
                c.x = a.x + b.x;
                c.y = a.y + b.y;
                c.z = a.z + b.z;
        }

It was originally (by Watcom) 21 clocks, he optimized it to 13 clocks. Here is his code (explanation of sx follows at the end of my code):

         command        cycles  fpu-stack   comment

 fld  dword ptr [eax+0] 1       ax
 fadd dword ptr [ebx+0] 2-4     sx
 fld  dword ptr [eax+4] 3       ay sx       Fld runs parallel to the fadd.
                                            This is called overlapping.
 fadd dword ptr [ebx+4] 4-6     sy sx
 fxch                   4       sx sy       Fchg is the only pairing
                                            FPU-command in the V-pipe.
 fld  dword ptr [eax+8] 5       az sx sy
 fadd dword ptr [ebx+8] 6-8     sz sx sy
 fxch                   6       sx sz sy
 fstp dword ptr [ecx+0] 7-8     sz sy       Fstp takes 2 cycles, no overlap.
 fstp dword ptr [ecx+8] 10-11   sy          This fstp has to wait 1 cycle
                                            (stall) because it can only store
                                            values that where ready in the
                                            preceeding cycle! And sz was
                                            ready in cycle 9.
 fstp dword ptr [ecx+4] 12-13   empty

                  above code:                   13 cycles

I tried it and did a good job but couldn't beat him. I tried this only once, but I don't think it can be still better. I got 13 clocks, too, and the implementation changed a little bit as well. Here's my version:

        fld word ptr [eax+0]      ;1    ax
        fadd word ptr [ebx+0]     ;2-4  sx
        fld word ptr [eax+4]      ;3    ay sx
        fadd word ptr [ebx+4]     ;4-6  sy sx
        fxch                      ;4    sx sy
        fld word ptr [eax+8]      ;5    az sx sy
        fxch                      ;5    sx az sy
        fstp word ptr [ecx+0]     ;6-7  az sy   <- no stall
        fadd word ptr [ebx+8]     ;8-10 sz sy
        fxch                      ;8    sy sz
        fstp word ptr [ecx+4]     ;9-10 sz      <- no stall
        fstp word ptr [ecx+8]     ;12-13        <- 1 clock stall E-(

sx means ax+bx, ax the x of the first vector, bx the x from the second. This code isn't commented, 'coz you must now know everything. If you still don't understand something, mail me and I'll explain it.

15 - Concepts

Here's a little dictionary:

 Fpu     - Floating Point Unit
 pro     - The processor (CPU)
 int     - Integer, inst. from the pro
 fp      - Floating point, 32bits
 inst    - Instructions
 ov      - Overlapping
 stall   - When any inst. must wait for another one to finish
 fstsw   - Float Set Status Word
 fldcw   - Float Load Code Word (or something like that OE-)
 opt     - Optimize
 E-)     - The same as a usual smiley but with glasses like mines.

16 - Closing words

Phew! The article is over. If you have any question or idea, let me know. I hope everybody already knows how the FPU works, if you don't, write me, and I will write an article.

Excuse me for my bad English, but I'm not at university yet (I'm 16).

Some greetings go to:

Astharoth^Tlotb - For the description of the cw and sw (I read them too late)
Agner Fog - For his GREAT article about optimizing
Tryx^Xography - For his little article, it enlightened me (Next time I'll beat you! ;-)

  "Blind we have been and blind we remain.
   They call that the mediatic influence."
                                 Fred^Calodox

17 - Contacting

You can contact me via e-mail. Sorry for possible delays, but I only check it once a week unless it's necessary (here in Spain Internet is very EXPENSIVE):

dario_phong@mixmail.com

And if you prefer snail:

        Arturo San Emeterio Campos
        c/Rosal n§ 48  4§ 1¦
        cp: 08004 BARCELONA (SPAIN)

- DAriO PhONG, 7/01/99 Barcelona