Fixed Point Maths

TAD

Introduction

After explaining what fixed-point maths is, and writing about three or four tutorials for ViKtOr, I realised that there are still plenty of new coders who don't understand fixed-point maths. So I hope to help these people without confusing them too much.

I will be using mainly hex (base 16) numbers and binary (base 2) numbers to help explain things, so if you don't know about these, then please, please go and learn about them!! You only need to remember 16 hex numbers and their binary values. Once you know these well you can convert from hex-to-binary faster than a calculator and far beyond their 10 digit display range.

A basic understanding of bytes, words, ADD, SUB and carry will also make this article far easier to digest.

What is "Fixed-Point" maths?

It is a method to describe real numbers (ones with an INTEGER part and a FRACTION part, e.g. 10.5 or 5.8 or 0.3 or 156.2736) using only INTEGER values.

Hey, where has the FRACTION part gone?

Well, because our 80x86 registers and instructions only work with integer values (like 0, 1, 2, 3, 4 ... 65533, 65534, 65535 for a word), we must get rid of the fraction part.

So 10.5 must either be described as 10 or 11, we can have NO fraction.

We can describe 10 and 4 easily enough, but if we try to divide 10 by 4 we get 2.5 and this cannot be described in integers, so we must choose 2 or 3.

Suppose we scaled up our answer (real number 2.5) by 2, then our answer would be 5, which can be stored in an integer value. This is basically how fixed-point works, we scale up a number to that the fraction becomes an integer.

I will use a scale factor of 256, which is easy to perform on most CPUs.

      Real Number           Fixed Point         hex
     -------------         -------------       -----
          0 decimal               0 decimal     0000
          1                     256             0100
          2                     512             0200
          3                     768             0300
          4                    1024             0400
        ...                   .....             ....
        254                   65024             FE00
        255                   65280             FF00

Okay, but how does that help with FRACTIONS?? Let's take a look at 0.5 (which is 1/2) and 1.5, 2.5 and 3.5:

      Real Number           Fixed Point         hex
     -------------         -------------       -----
          0.5 decimal           128 decimal     0080
          1.5                   384             0180
          2.5                   640             0280
          3.5                   896             0380

So simply by scaling up the real number (which has an integer and fraction) by 256 we end up with an integer fixed-point number.

Looking at the "hex" column shows the fixed-point number written in hex (base 16). You have probably noticed already that the integer part (0,1,2,3) is unchanged and has only been moved two hex digits to the left. This is because 256 decimal = 0100 hex.

Why do we have 0080 hex for 0.5 and not 0050 hex?

Because 0100 hex represents the real number 1.0 (remember our scale is 256) and 0.5 means 1/2 (a half of one), so with a scale of 256 this means 256/2 which is 128 = 0080 hex.

If we wanted to describe 0.25 (which is 1/4) in fixed-point then this would equal 256/4 = 0040 hex.

How can we convert a real number to fixed point?

First of all we need a "scale" factor, this is usually 256 or 65536 because it makes programming so much easier. This scale tells us how many bits we can use for the pseudo-fraction part.

Secondly we need to decide how many bits we need for the integer part. This is usually 8 or 16 bits.

Now whenever you see 8.8 or 16.16 fixed-point format, this means tells us how many bits have been used for the integer and fraction. So 9.16 means an 9-bit integer and a 16-bit fraction, and 2.14 means a 2-bit integer and a 14-bit fraction.

Here are some common fixed-point formats and their scale factors.

   x.y       Integer     Fraction          Scale factor
 -------    ---------   ----------       ---------------
   4.8         4 bits      8 bits         2^8   =  256
   8.8         8           8              2^8   =  256
   2.14        2          14              2^14  =  16384
   1.15        1          15              2^15  =  32768
   8.16        8          16              2^16  =  65536
  16.8        16           8              2^8   =  256
  16.16       16          16              2^16  =  65536

Once we have the number of pseudo-fraction bits we can calculate the scale factor by using a power of 2.

Say we have the real number 3.5 and we used a fixed-point format of 4.8 then our scale factor is 2^8 = 256 and 3.5 * 256 = 896 (or 380 hex).

So to convert from a real number into fixed-point:

                  fixed point value  =  real number  *  scale

and to convert from fixed-point back into a real number:

                                       fixed point value
                     real number  =  ---------------------
                                             scale

For example, the 8.8 fixed point number 1840 hex is the real number 1840 hex / 256 = 24.25.

Fixed Point Addition

There is no difference between normal integer addition and fixed-point addition. The exact same code can be used for both.

For example, say we had the real numbers 9.5 and 4.25, then in 8.8 format they would be:

   Real number        * scale factor           8.8 format
  -------------       ----------------        ------------
      9.5             9.5 * 256 = 2432           0980 hex
      4.25            4.25 * 256 = 1088          0440 hex     +
                                                ÍÍÍÍÍÍÍÍÍÍÍ
                                                 0DC0 hex

If we convert 0DC0 hex back into a real number (divide it by 256) we get:

                                 0DC0 hex     <--- 8.8 fixed point
  real number     13.75   =    -------------
                                  256         <--- scale factor

And this is the correct answer, 9.5 + 4.25 = 13.25, easy huh?

Fixed Point Subtraction

Again there is no difference between fixed-point and normal integer subtraction. The normal SUB or SBB instructions can be used.

   Real number        * scale factor           8.8 format
  -------------       ----------------        ------------
      9.5             9.5 * 256 = 2432           0980 hex
      4.25            4.25 * 256 = 1088          0440 hex     -
                                                ÍÍÍÍÍÍÍÍÍÍÍ
                                                 0540 hex

Now converting 0540 hex back into a real number, we get:

                                 0540 hex     <--- 8.8 fixed point
  real number      5.25   =    -------------
                                  256         <--- scale factor

Which again is correct.

Fixed-Point Multiplication

It has been very easy so far, now it gets a little more difficult.

We can use the normal MUL and IMUL instructions to perform unsigned and signed multiplication, but we need to take into account the scale-factor.

Let's take a really easy calculation, 10 * 2 = 20. In fixed-point 8.8 format these numbers would be:

   Real number        * scale factor           8.8 format
  -------------       ----------------        ------------
     10.0            10.0 * 256 = 2560          0A00 hex
      2.0             2.0 * 256 = 512           0200 hex     * (times)
                                         ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
                                              140000 hex

But now if we convert back to a real number:

                                140000 hex    <--- fixed point
  real number     5120   =    -------------
                                  256         <--- scale factor

This is wrong! We should get 20 as our real number!!!

Well the reason is we are performing the normal A = B * C multiply, but both B and C have been scaled up by 256 (our scale factor). So in fact we are performing this:

                         A  =  (B * 256)  *  (C * 256)

We can lose the (..) brackets to get:

                           A  =  B * 256  *  C * 256

Or this:

                            A  =  B * C * 256 * 256

Or this:

                              A  =  B * C * 65536

The answer A is 65536 times too big!! (I.e. scale factor * scale factor.) What we really want is A * 256 so it is already in 8.8 fixed-point format. A simple divide by scale factor (256) gives us the correct answer.

So the correct way to perform fixed-point multiplication is:

                                   B    *    C
                        A  =  --------------------------
                                   scale factor

Remember that A, B and C are fixed-point numbers and NOT real numbers.

Fixed-Point Division

Just like multiplication, division also needs the answer to be scaled by the fixed-point factor (256 for example) to give the correct result.

But this time we scale up one fixed-point number before division, instead of scaling down the answer after multiplication.

   Real number        * scale factor           8.8 format
  -------------       ----------------        ------------
     20.0            20.0 * 256 = 5120          1400 hex
      2.0             2.0 * 256 = 512           0200 hex     / (divide)
                                         ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
                                                000A hex

Which is wrong, the answer should be 0A00 hex.

The correct formula for division is this:

                                 B    *   scale factor
                        A  =  --------------------------
                                       C

Trying this with real numbers 20.0 and 2.0 gives us:

                                 1400 hex  *   scale factor
                    0A00 hex  =  --------------------------
                                          0200 hex

So the answer as a real number is 0A00 hex / 256 = 10.0

Some words about SIN and COS

Way back in 1988/1989 I wrote a simple 3d engine on the Amiga and Atari ST. It had some filled spaceships and fast, filled circles as planets. It was at this time when I discovered fixed-point maths for the first time and used SIN and COS for many calculations (rotation, movement etc...). There is a very important thing to remember when using fixed-point maths for SIN and COS calculations which I had forgot until trying to teach ViKtOr how to draw a a circle using them.

Okay, both functions can be performed using a look-up table, where each entry in the table describes a SIN/COS value for a particular angle (0 degrees, 1 degree, 2 degrees ... 359 degrees).

We know that we need to use fixed-point maths so we can store and use these SIN and COS values. Perhaps the most obvious scale-factor is 65536 so that each SIN value is stored in a word, or perhaps 256 so a byte can be used.

But there is a problem, a SIN/COS value has a maximum value of +1.00 if we scaled it up by 256 we would get 256 (0100 hex) and this is TOO big to fit inside a single byte. Likewise with a scale-factor of 65536 the number +1.00 would become 65536 (10000 hex) which is again TOO big, for a word.

So let's use a scale value of 128 for a byte, this means +1.00 would become 128 (80 hex) which fits neatly into a byte. Or we could use 32768 for a word so +1.00 would become 32768 (8000 hex).

But....

SIN and COS values can be negative as well as positive, so we have a possible range of -1.0 to +1.0 so we need to use SIGNED numbers.

If you are clever you may have already noticed that 128 and 32768 are both negative numbers in a signed byte and word. Some people avoid this by using 127 and 32767 to represent +1.00 in a byte and word, but this is wrong.

In fact you need to use a minimum of two integer bits. One for the sign bit and one for the integer value. So for a byte we would use a 2.6 format and for a word we use a 2.14 format.

                      (scale = 64)       (scale = 16384)
   Real number         2.6 format          2.14 format
  -------------       ------------        -------------
     -1.0 decimal       C0 hex              C000 hex
     -0.5               E0                  E000
      0.0               00                  0000
     +0.5               20                  2000
     +1.0               40                  4000

If you were using the 2.14 format to store a SIN or COS value and wanted to multiply it by an integer (a radius for example), then try this:

              MOV     AX, <radius>        <--- integer value
              MOV     DX, <sin value>     <--- 2.14 fixed-point format

              IMUL    DX
              SHL     AX, 1
              RCL     DX, 1
              SHL     AX, 1
              RCL     DX, 1

Gives: DX = <radius> * <sin value> / 16384

Remember that is answer in the DX register is ONLY an integer value and NOT a fixed-point value.

The SHL and RCL instructions can also be done using ADD and ADC instead:

              MOV     AX, <radius>        <--- integer value
              MOV     DX, <sin value>     <--- 2.14 fixed-point format

              IMUL    DX
              ADD     AX, AX
              ADC     DX, DX
              ADD     AX, AX
              ADC     DX, DX

On a 80386+ CPU you may want to use the SHLD instruction instead.

              MOV     AX, <radius>        <--- integer value
              MOV     DX, <sin value>     <--- 2.14 fixed-point format

              IMUL    DX
              SHLD    DX, AX, 2

And now finally, some 80x86 code!!

I will use some 32-bit (80386+) instructions and a fixed-point format of 16.16 bits for the calculations. This means a scale factor of 65536.

The operands marked <a> and <b> are assumed to be 1.15.16 fixed point numbers with 1 bit sign and 15-bit integer and 16-bit fraction.

Addition:

              MOV    EAX, <a>
              ADD    <b>, EAX

Subtraction:

              MOV    EAX, <a>
              SUB    <b>, EAX

Multiplication:

              MOV    EAX, <a>
              MOV    EDX, <b>
              IMUL   EDX
              SHRD   EAX, EDX, 16     ; EAX = (<a> * <b>) / 65536

Division:

              MOV    EAX, <a>
              MOV    EDX, EAX
              SAR    EDX, 16
              SHL    EAX, 16
              IDIV   <b>              ; EAX = (<a> * 65536) / <b>

And now some integer <n> & fixed-point <a> 80x86 code:

Multiplication: (fixed = fixed * integer)

              MOV    EAX, <a>
              MOV    EDX, <n>
              IMUL   EDX              ; EAX = <a> * <n>

Multiplication: (fixed = fixed / integer)

              MOV    EAX, <a>
              CDQ
              IDIV   <n>              ; EAX = <a> / <n>

Closing Words

Be very careful when using fixed-point maths, make sure that the numbers you will be using will fit easily into the format you are using, and don't forget to allow for a sign-bit (if you need it).

Also make sure you are performing the correct multiplication or division technique on your numbers. An operation using two fixed-point numbers is NOT the same as an operation on one fixed-point and one integer number.

And above all else, remember that fixed-point numbers are just scaled up real numbers!!

Good luck.

Regards

TAD #:o)