Optimizing The Putbits

Dario Phong

Table of contents
  - Introduction
  - The 'proc'
  - First optimization
  - Second optimization
  - The same in pmode
  - Get bits
  - Limiting the buffer
  - Improving this doc
  - Closing words
  - Contacting the author

Introduction

This article is aimed to a coder who has already implemented a compressor, knows how ASM works and how to optimize for Pentium. If you have never done any of those things, then better do them, and when it's done, read this.

Everybody who has even learned ASSEMBLER has also 'studied' the putpixel. A very optimized routine. Why? That's for three reasons: first is the way for understanding how the VRAM works, second it's used always in video routines, and third, everybody showed his putpixel.

Now think about the putbits. It's almost the most used routine from compressors. A lot of clocks cycles are spent in your putbits. Do you really think you can optimize you compressor/parser/... enough? I hope so, but a fast putbits is always needed, so here we talk about it.

I know the best for optimizing is pmode, but I still don't work on it (soon I'll do so ;-), so we will use real mode, and we will optimize for a plain Pentium (586), so specific optimizations for other procesors will not be put here, unless someone really feels that's necessary. And I don't think so. E-)

This doc isn't intended to be a plain source-code, so you better understand how everything works.

The 'proc'

The routine is simple: Every time it's called it needs a byte of source, the bits to put into the output buffer, and the number of bits to write.

Then the routine puts the bits, and if the byte is full (= we wrote 8 bits) we update the output buffer. For putting the bit in the right position we shift it and then 'or' the output byte.

Here it is the putbits that I use, at least till now: ;-)

 ;--------------------------------------------------------------------
 ; Putbits
 ; Input: cx-> number of bits to write  bl->bits to be written
 ; Output: nothing
 ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
 ;--------------------------------------------------------------------

 put_bits:

   test            cx,cx          ;I used that. Put it ONLY if you know
   jz short        @@ptb_fin      ;that maybe you will try to write 0 bits
                                  ;but better avoid writing 0 bits ;-D
   push ax es di                  ;Save only the registers that you know
                                  ;are currently used.
   mov             es,buf_com     ;ds is faster than any other segment
                                  ;register. But it's needed for some data
 @@ptb_more_bits_:                     ;in my 'main' loop OE-)
   push            cx             ;We need cx for the shl/cl
   xor             ax,ax          ;clear it
   mov             bh,ptb_byte    ;get the byte to finally write
   mov             al,bl          ;get the input byte (where the bits are)
   shr             al,1           ;put the first bit in the carry flag
   adc             ah,0           ;Then to ah. I know there's salc
                                  ;but I think it's an undocumented inst.
                                  ;from Intel's Pentium, so...
   mov             bl,al          ;save the shifted byte
   mov             cl,ptb_totalbits ;how many bits we wrote
   shl             ah,cl          ;Put the bit to its position
   or              bh,ah          ;and then update the output byte
   mov             ptb_byte,bh    ;save the output byte
   ;now let's see if we must update the output buffer
   inc             ptb_totalbits  ;we wrote one bit
   cmp             ptb_totalbits,8 ;did we wrote the full byte?
   jb short        @@ptb_no_byte  ;90h
   mov             ptb_totalbits,0 ;start again to write the bits
   mov             di,ptr_com     ;the pointer to the input buffer
   mov             es:[di],bh     ;the output byte
   inc             di             ;next position
   mov             ptr_com,di     ;save it for next time
   inc             bytes_com      ;the number of written bits
   mov             ptb_byte,0     ;erase the output byte
 @@ptb_no_byte:
   pop             cx             ;the number of bits to write
   loop short      @@ptb_more_bits_ ;are we done?
   pop     di es ax               ;restore the used registers
 @@ptb_fin:
   ret

Problems? Here? A lot. There's no pairing, never. The shl/cl takes 4 clocks, push&pop isn't very funny, of course everything can be worse, the data may be misaligned, or in the same dword-boundary...

Let's count the clocks: 6+(16*number of bits)+6(if we update out. buffer). This is approximately the clocks for that piece of code, of course assuming that all the data is in the level 1 cache (that's like assuming Bill Gates is documenting every secret of Windows ;-).

To write one bit and update the buffer, this will take 28 clocks. I tried it with a program that reads the counters, and it took 65 clocks!

Writing 8 bits, so writing the byte to the output buffer too: 252 clocks.

First optimization

We will try to do more pairing, and loading, and saving the data only when it starts and when ends:

 ;--------------------------------------------------------------------
 ; Putbits
 ; Input: dx-> number of bits to write  al->bits to be written
 ; Output: nothing
 ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
 ;--------------------------------------------------------------------

 put_bits:

        mov             bh,ptb_byte
        mov             cl,ptb_totalbits
        mov             es,buf_com      ;yes, still using es }E-)
        mov             di,ptr_com
        push            cx di           ;this pairs

 @@ptb_more_bits_:                      ;the 'main' loop

        xor             ah,ah           ;1 u
        shr             al,1            ;2 u
        adc             ah,0            ;3 u
        shl             ah,cl           ;4-7 u - the bits written
        or              bh,ah           ;8 u
        inc             cl              ;1 v

        cmp             cl,8            ;9 u
        jb short        @@ptb_no_byte   ;9 v

        xor             cx,cx           ;1 u - the counter of written bits
        mov             es:[di],bh      ;2 u
        inc             di              ;2 v - next byte
        mov             ptr_com,di      ;3 u - save only if incremented
        inc             bytes_com       ;3 v
        xor             bh,bh           ;4 u - output byte

 @@ptb_no_byte:

        dec             dx              ;10 u - the counter
        jnz short       @@ptb_more_bits_ ;10 v

        mov             ptb_byte,bh
        mov             ptb_totalbits,cl
        pop             di cx           ;this pairs too
        ret

This version has been debugged at least twice or more. It works. And it's better than the first.

7+(10*number of bits)+4(if we update out. buffer)

It's ok, don't you think? We improved the pairing, and sure that writing for example 5 bits is far better than in the first attempt.

The real timing for this is 51 clocks for one bit. For 8 bits 145! We have reduced the number of clock cycles from 252 to 145. Quite good. E-)

The bit organization is the following:

        0000 0000<- first bit
              ||--- second bit
              |---- third bit ...

Second optimization

One of the fucking things in this loop is the 4-clock shl/cl. How can we avoid it? Easy, every time we put a new bit we shift the byte to the left. This implies two things: we only spend 1 clock with the shift, BUT the order of the bits in the byte is inverted, so:

 first bit->0000 0000
 second bit--||     |- last bit
 third bit----|

If you choose this putbits, then you MUST change your getbits, but not a lot. Your first getbits shifted to the left. This one will shift to the right.

 ;--------------------------------------------------------------------
 ; Putbits
 ; Input: cl-> number of bits to write  al->bits to be written
 ; Output: nothing
 ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
 ;--------------------------------------------------------------------

 put_bits:

        push            di
        mov             es,buf_com      ;some initalizations
        mov             di,ptr_com
        mov             ah,ptb_byte
        mov             ch,ptb_totalbits

 @@ptb_more_bits_:                      ;the 'main' loop

        shr             al,1            ;u 1 - put bit in the carry flag
        adc             ah,0            ;u 2 - put it in output byte
        inc             ch              ;v 2 - the count of written bits

        cmp             ch,8            ;3 u
        jb short        @@ptb_no_byte   ;3 v

        xor             ch,ch           ;1 u - the counter of written bits
        mov             es:[di],ah      ;2 u
        inc             di              ;2 v
        mov             ptr_com,di      ;3 u
        inc             bytes_com       ;3 v
        xor             ah,ah           ;4 u - clear output byte

 @@ptb_no_byte:

        shl             ah,1             ;u 4 - put the bit in its position
        dec             cl               ;5 u - the counter
        jnz short       @@ptb_more_bits_ ;5 v

        mov             ptb_byte,ah
        mov             ptb_totalbits,ch
        pop             di
        ret

There are some things which have to be mentioned: Why is the 'shl ah,1' after the comparison and not just after the 'inc ch'? Easy, when you have written 8 bits, you shift again to the left, then bit #7 will go out. In that way they are only shifted if the byte isn't full. Once the updating of the buffer 'shl ah,1' works, but don't worry, the content of ah is 0, so there's no problem with it. As you see, the body of the loop is smaller and all the instructions are simple and only one clock. I will not care more for 'theory' timing, so here it's the real timing: One bit: 41 - Eight bits: 110 It is better than the previous version, and it's more than 50% faster than the first. Hey, I'm proud of it. OE-) I must say that the first was very poorly coded, so it wasn't a very difficult job. E-) The same in pmode Well, since I'm now learning pmode I decided to do the same version but in pmode. This is not really very different. Anyway here it goes:

 ;--------------------------------------------------------------------
 ; Putbits. Pmode.
 ; Input: cl-> number of bits to write  al->bits to be written
 ; Output: nothing
 ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
 ;--------------------------------------------------------------------


 put_bits:

          mov             edi,ptr_com     ;I like pmode E-)
          mov             ah,ptb_byte
          mov             ch,ptb_totalbits
          push            edi

 @@ptb_more_bits_:                        ;the 'main' loop

          shr             al,1            ;u 1 - put bit in the carry flag
          adc             ah,0            ;u 2 - put it in output byte
          inc             ch              ;v 2 - the counter of written bits

          cmp             ch,8            ;3 u
          jb short        @@ptb_no_byte   ;3 v

          xor             ch,ch           ;1 u - the counter of written bits
          mov             [edi],ah        ;2 v
          inc             edi             ;3 u
          inc             bytes_com       ;3 v - I reduced 2 clocks putting
          mov             ptr_com,edi     ;3 u - <- this there
          xor             ah,ah           ;4 u - clear output byte

 @@ptb_no_byte:

          shl             ah,1            ;u 4 - put the bit in his position
          dec             cl              ;5 u - the counter
          jnz short       @@ptb_more_bits_ ;5 v

          mov             ptb_byte,ah
          mov             ptb_totalbits,ch
          pop             edi
          ret

This is, at least till someone publishes a better one, the best put_bits out there. Putting one bit: 39 clocks. Putting 8: 105 clocks.

Get bits

This document isn't a class of "How to cut'n'paste", this is a "Learn and implement" class, or at least I wanted it to be in that way, but optimization requires examples, so the source code is here. But there's still the get bits. Once you've learned all this stuff, write your own. If you get good results, show us your implementation.

Did you really think you would get no homework? X-D

Remember that if you use the last version for getbits you must use shr, and if you use one of the others, shl. Or something like that. OE-)

Limiting the buffer

Today I was planning the memory needed for my new compressor (really a lot E-). And as memory is limited, you have to choose well how you will distribute it. Let's think. A lot of memory for the input file. More memory for the data structures needed by the compressor... So we now have almost no free mem, and we still need memory for the output buffer.

And I came to this solution:

The output buffer will be 5k long or something like that, the put_bits will have a counter with all the bytes written in this buffer (this is not bytes_com), and when it's 5k (so the buffer is full) we write it to the output file, and start again writing in the 5k buffer.

This has not been tested yet, so maybe 5k is not enough and spends a lot of time writing.

Improving this doc

Of course this doc isn't finished, it still needs YOUR help. If you improve the put_bits, or do a good get_bits, send them to me, and a new version of that doc will be released. You will be properly credited. Let's do a good piece of optimization. Of course other ideas around that will be accepted, making the put_bits get the source bits from a word, etc. If you read something that doesn't make sense or something, let me know. If you have any idea, email me: dario_phong@mixmail.com

Closing words

My words have arrived at the end, at least till someone else wants to put his work in favour of everybody. Nothing else to say. Well, something else: If I don't get any feedback I will not update this doc. E-(

Today I've written one article and finished it, and I still haven't coded my new compressor in pmode. }E-)

Contacting the author

Contacting me is as simple as sending an email. E-)

dario_phong@mixmail.com

I check it almost every week, so don't worry if my email is a little bit late. And check my page, and join my email list (I'll send you an email, when I've updated my page with new things to download).

WWW: http://www.geocities.com/SiliconValley/Byte/6789/

- DAriO PhONG^PhyMosys, Barcelona 20-29/1/1999